This is the code for our EMNLP 2018 paper "Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation"
Event extraction is of practical utility in natural language processing. In the real world, it is a common phenomenon that multiple events existing in the same sentence, where extracting them are more difficult than extracting a single event. Previous works on modeling the associations between events by sequential modeling methods suffer a lot from the low efficiency in capturing very long-range dependencies. In this paper, we propose a novel Jointly Multiple Events Extraction (JMEE) framework to jointly extract multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and attention-based graph convolution networks to model graph information. The experiment results demonstrate that our proposed framework achieves competitive results compared with state-of-the-art methods.READ FULL TEXT VIEW PDF
Event extraction (EE) is one of the core information extraction tasks, w...
Document-level event extraction is important for indexing the most impor...
We propose a novel approach to event extraction that supplies models wit...
Most existing event extraction (EE) methods merely extract event argumen...
Biomedical event extraction is critical in understanding biomolecular
Joint-event-extraction, which extracts structural information (i.e., ent...
The work presented in this master thesis consists of extracting a set of...
This is the code for our EMNLP 2018 paper "Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation"
Extracting events from natural language text is an essential yet challenging task for natural language understanding. When given a document, event extraction systems need to recognize event triggers with their specific types and their corresponding arguments with the roles. Technically speaking, as defined by the ACE 2005 dataset111https://catalog.ldc.upenn.edu/ldc2006t06, a benchmark for event extraction Grishman et al. (2005)
, the event extraction task can be divided into two subtasks, i.e., event detection (identifying and classifying event triggers) and argument extraction (identifying arguments of event triggers and labeling their roles).
In event extraction, it is a common phenomenon that multiple events exist in the same sentence. Extracting the correct multiple events from those sentences is much more difficult than in the one-event-one-sentence cases because those various types of events are often associated with each other. For example, in the sentence “He left the company, and planned to go home directly.”, the trigger word left may trigger a Transport (a person left a place) event or an End-Position (a person retired from a company) event. However, if we take the following event triggered by go into consideration, we are more confident to judge it as a Transport event rather than an End-Position event. This phenomenon is quite common in our real world, as Injure and Die events are more likely to co-occur with Attack events than others, whereas Marry and Born events are less likely to co-occur with Attack events. As we investigated in ACE 2005 dataset, there are around 26.2% (1042/3978) sentences belong to this category.
Significant efforts have been dedicated to solving this problem. Most of them exploiting various features Liu et al. (2016b); Yang and Mitchell (2016); Li et al. (2013); Keith et al. (2017); Liu et al. (2016a); Li et al. (2015)
, introducing memory vectors and matricesNguyen et al. (2016), introducing more transition arcs Sha et al. (2018), keeping more contextual information Chen et al. (2015) into sentence-level sequential modeling methods like RNNs and CRFs. Some also seek features in document-level methods Liao and Grishman (2010); Ji and Grishman (2008). However, sentence-level sequential modeling methods suffer a lot from the low efficiency in capturing very long-range dependencies while the feature-based methods require extensive human engineering, which also largely affects model performance. Besides, these methods do not adequately model the associations between events.
An intuitive way to alleviate this phenomenon is to introduce shortcut arcs represented by linguistic resources like dependency parsing trees to drain the information flow from a point to its target through fewer transitions. Comparing to sequential order, modeling with these arcs often successfully reduce the needed hops from one event trigger to another in the same sentences. In Figure 1, for example, there are two events: a Die event triggered by the word killed with four arguments in red and an Attack event triggered by the word barrage with three arguments in blue. We need six hops from killed to barrage according to sequential order, but only three hops according to the arcs in dependency parsing tree (along the nmod-arc from killed to witnesses, along the acl-arc from witnesses to called, and along the xcomp-arc from called to barrage). These three arcs consist of a shortcut path222In a shortcut path which consists of existing arcs, some arcs may reverse their directions., draining the dependency syntactic information flow from killed to barrage with fewer hops333The length of the longest path in a tree is always no more than the sequential length consisting of the same number of nodes, which means even in the worst cases, the shortcut path will not perform worse than sequential modeling..
In this paper, we propose a novel Jointly Multiple Events Extraction (JMEE) framework by introducing syntactic shortcut arcs to enhance information flow and attention-based graphic convolution networks to model the graph information. To implement modeling with the shortcut arcs, we adopt the graph convolutional networks (GCNs) Kipf and Welling (2016); Marcheggiani and Titov (2017); Nguyen and Grishman (2018) to learn syntactic contextual representations of each node by the representative vectors of its immediate neighbors in the graph. And then we utilize the syntactic contextual representations to extract triggers and arguments jointly by a self-attention mechanism to aggregate information especially keeping the associations between multiple events.
We extensively evaluate the proposed JMEE framework with the widely-used ACE 2005 dataset to demonstrate its benefits in the experiments especially in capturing the associations between events. To summary, our contribution in this work is as follows:
We propose a novel joint event extraction framework JMEE based on syntactic structures which enhance information flow and alleviate the phenomenon where multiple events are in the same sentence.
We propose a self-attention mechanism to aggregate information especially keeping the associations between multiple events and prove it is useful in event extraction.
We achieve the state-of-the-art performance on the widely used datasets for event extraction using the proposed model with GCNs and self-attention mechanism.
Generally, event extraction can be cast as a multi-class classification problem deciding whether each word in the sentence forms a part of event trigger candidate and whether each entity in the sentence plays a particular role in the event triggered by the candidate triggers. There are two main approaches to event extraction: (i) the joint approach that extracts event triggers and arguments simultaneously as a structured prediction problem, and (ii) the pipelined approach that first performs trigger prediction and then identifies arguments in separate stages. We follow the joint approach that can effectively avoid the propagated errors in the pipeline.
Additionally, we extract events in sentence-level mainly for three reasons. Firstly, in our investigation, we find that the document-level co-occurrence distributions of 33 types of events in the ACE 2005 dataset are relatively similar to the sentence-level co-occurrence distributions. In Figure 2, for example, the blue bars and the orange bars indicate the conditional probability distribution of all the 33 event types in sentences and documents, respectively, where an Attack event appears. While the green bars and the red bars indicate the sentence-level and document-level conditional probability distributions respectively. As we can see from this figure, the top three types of Attack event in co-occurrence relationships are Die, Transport and Injure, while those of Meet event are Attack, Transport, and Transfer-Money. Although different types of events have different co-occurrence relationships, the conditional probability distributions in two levels of the same event type are relatively similar444We only focus on the top-K co-occurrence relationships because the rest are too sparse for statistic analysis.. Secondly, there are many off-the-shelf sentence-level linguistic resources in the NLP community which can offer analytical information about the shortcut paths of some structures, like dependency parsing trees, AMR parsing graphs, and semantic role labeling structures. Last but not least, we also find that events within the same sentences have more explicit relationships with each other than events in different sentences of a document, which means the associations between two events is more accessible to capture.
Let be a sentence of length where is the -th token. Similarly, let be the entity mentions in the sentence where is the number of the entity mentions. We apply the BIO annotation schema to assign trigger label to each token , as there are triggers that consist of multiple tokens. If we can get trigger candidates of certain types from trigger labels in BIO annotation schema, we then need to predict the roles (if any) that each entity mention plays in such events.
Our JMEE framework consists of the following four modules: (i) word representation module that represents the sentence with vectors, (ii) syntactic graph convolution network module that performs convolution operations by introducing shortcut arcs from syntactic structures, (iii) self-attention trigger classification module that captures the associations between multiple events in a sentence, and (iv) argument classification that predicts the roles each entity mention plays in the event candidates of specific types, as shown in Figure 3.
In the word representation module, each token in the sentence is transformed to a real-valued vector by looking up in embedding matrices and concatenating the following vectors:
The word embedding vector of : This is obtained by looking up a pre-trained word embedding matrix Glove Pennington et al. (2014).
The POS-tagging label embedding vector of : This is generated by looking up the randomly initialized POS-tagging label embedding table.
The entity type label embedding vector of : Similarly to the POS-tagging label embedding vector of , we annotate the entity mentions in a sentence using BIO annotation schema and transform the entity type labels to real-valued vectors by looking up the embedding table. It should be noticed that we use the whole entity extent in ACE 2005 dataset which contains overlapping entity mentions and we sum all the possible entity type label embedding vectors for each token.
The transformation from the token to the vector essentially converts the input sentence into a sequence of real-valued vectors , which will be feed into later modules to learn more effective representations for event extraction.
Considering an undirected graph as the syntactic parsing tree for sentence , where and are sets of nodes and edges, respectively. In , each is the node representing token in . Each edge is a directed syntactic arc from token to token , with the type label . Additionally, to allow information to flow against the direction, we also add reversed edge with the type label . Following DBLP:journals/corr/KipfW16, we also add all the self-loops, i.e., for any . For example, in the dependency parsing tree shown in Figure 1, there are four arcs in the subgraph with only two nodes “killed” and “witnesses”: the dependency arc with the type label , the revresed dependency arc with the additional type label , and the two self-loops of “killed” and “witnesses” with type label .
Therefore, in the -th layer of syntactic graph convolution network module, we can calculate the graph convolution vector for node by:
where indicates the type label of the edge ; and are the weight matrix and the bias for the certain type label , respectively; is the set of neighbors of including (because of the self-loop);
is the activation function. Moreover, we use the output of the word representation moduleto initialize the node representation of the first layer of GCNs.
After applying the above two changes, the number of predefined directed arc type label (let us say, ) will be doubled (to ). It means we will have sets of parameter pairs and for a single layer of GCN. In this work, we use Stanford Parser Klein and Manning (2003) to generate the arcs in dependency parsing trees for sentences as the shortcut arcs. The current representation contains approximately 50 different grammatical relations, which is too high for the parameter number of a single layer of GCN and not compatible with the existing training data scale. To reduce the parameter numbers, following DBLP:conf/emnlp/MarcheggianiT17, we modify the definition of type label to:
where the new only have three type labels.
As not all types of edges are equally informative for the downstream task, moreover, there are also noises in the generated syntactic parsing structures; we apply gates on the edges to weight their individual importances. Inspired by DBLP:conf/icml/DauphinFAG17,DBLP:conf/emnlp/MarcheggianiT17, we calculate a weight for each edge indicating the importance for event extraction by:
is the logistic sigmoid function,and are the weight matrix and the bias of the gate. With this additional gating mechanism, the final syntactic GCN computation is formulated as
As stacking layers of GCNs can model information in hops, and sometimes the length of shortcut path between two triggers is less than , to avoid information over-propagating, we adapt highway units Srivastava et al. (2015), which allow unimpeded information flowing across stacking GCN layers. Typically, highway layers conduct nonlinear transformation as:
where is the sigmoid function; is the element-wise product operation; is a nonlinear activation function; is called transform gate and is called carry gate. Therefore, the input of the -th GCN layers should be instead of .
The GCNs are designed to capture the dependencies between shortcut arcs, while the layer number of GCNs limits the ability to capture local graph information. However, in this cases, we find that leveraging local sequential context will help to expand the information flow without increasing the layer number of GCNs, which means LSTMs and GCNs maybe complementary. Therefore, instead of feeding the word representation into the first GCN layer, we follow DBLP:conf/emnlp/MarcheggianiT17, apply Bidirectional LSTM (Bi-LSTM) Hochreiter and Schmidhuber (1997) to encode the the word representation as:
and the input of -th token to GCNs is , where is the concatenation operation. The Bi-LSTM adaptively accumulates and abstracts the context for each token in the sentence.
When taking each token as the current word, we get the representation
from all tokens calculated by GCNs. Traditional event extraction systems often use max-pooling or its amelioration to aggregate information to each position. However, the max-pooling aggregation mechanisms tend to produce similar results after GCN modules in our framework. For example, if we get the aggregated vectorat each position by this max-pooling mechanism with the GCNs output in which is the sentence length, and the vector is all the same at each position. Besides, predicting a trigger label for a token should take other possible trigger candidates into consideration. To capture the associations between triggers in a sentence, we design a self-attention mechanism to aggregate information especially keeping the associations between multiple events.
Given the current token , the self-attention score vector and the context vector at position are calculated as:
where means the normalization operation. Then we feed the context vector into a fully-connected network to predict the trigger label in BIO annotation schema as:
where is a non-linear activation and is the final output of the -th trigger label.
When we have extracted an entire trigger candidate, which is meeting an O label after an I-Type label or a B-Type label, we use the aggregated context vector to perform argument classification on the entity list in the sentence.
For each entity-trigger pair, as both the entity and the trigger candidate are likely to be a subsequence of tokens, we aggregate the context vectors of subsequences to trigger candidate vector and entity vector by average pooling along the sequence length dimension. Then we concatenate them together and feed into a fully-connected network to predict the argument role as:
where is the final output of which role the -th entity plays in the event triggered by the -th trigger candidate.
When training our framework, if the trigger candidate that we focus on is not a correct trigger, we set all the golden argument labels concerning the trigger candidate to OTHER (not any roles). With this setting, the labels of the trigger candidate will be further adjusted to reach a reasonable probability distribution.
In order to train the networks, we minimize the joint negative log-likelihood loss function. Due to the data sparsity in the ACE 2005 dataset, we adapt our joint negative log-likelihood loss function by adding a bias item as:
where is the number of sentences in training corpus; , and are the number of tokens, extracted trigger candidates and entities of the -th sentence; is an indicating function, if is not O, it outputs a fixed positive floating number bigger than one, otherwise one; is also a floating number as a hyper-parameter like .
|Identification (%)||Classification (%)||Identification (%)||Role (%)|
Dataset, Resources and Evaluation Metric
Dataset, Resources and Evaluation Metric
We evaluate our JMEE framework on the ACE 2005 dataset. The ACE 2005 dataset annotate 33 event subtypes and 36 role classes, along with the NONE class and BIO annotation schema, we will classify each token into 67 categories in event detection and 37 categories in argument extraction. To comply with previous work, we use the same data split as the previous work Ji and Grishman (2008); Liao and Grishman (2010); Li et al. (2013); Chen et al. (2015); Liu et al. (2016b); Yang and Mitchell (2016); Nguyen et al. (2016); Sha et al. (2018). This data split includes 40 newswire articles (881 sentences) for the test set, 30 other documents (1087 sentences) for the development set and 529 remaining documents (21,090 sentences) for the training set.
We deploy the Stanford CoreNLP toolkit555http://stanfordnlp.github.io/CoreNLP/ to preprocess the data, including tokenizing, sentence splitting, pos-tagging and generating dependency parsing trees.
Also, we follow the criteria of the previous work Ji and Grishman (2008); Liao and Grishman (2010); Li et al. (2013); Chen et al. (2015); Liu et al. (2016b); Yang and Mitchell (2016); Nguyen et al. (2016); Sha et al. (2018) to judge the correctness of the predicted event mentions.
For all the experiments below, in the word representation module, we use 300 dimensions for the embeddings and 50 dimensions for the rest three embeddings including pos-tagging embedding, positional embedding and entity type embedding. In the syntactic GCN module, we use a three-layer GCN, a one-layer Bi-LSTM with 220 hidden units, self-attention with 300 hidden units and 200 hidden units for the rest transformation. We also set dropout rate to 0.5 and L2-norm to 1e-8. The batch size in our experiments is 32, and we utilize a maximum length
of sentences in the experiments by padding shorter sentences and cutting off longer ones. These hyperparameters are either randomly searched or chosen by experiences when tuning in the development set.
We use ReLUGlorot et al. (2011)
as our non-linear activate function. We apply the stochastic gradient descent algorithm with mini-batches and the AdaDelta update ruleZeiler (2012). The gradients are computed using back-propagation. During training, besides the weight matrices, we also fine-tune all the embedding tables.
We compare our performance with the following state-of-the-art methods:
Cross-Event is proposed by DBLP:conf/acl/LiaoG10, which uses document level information to improve the performance of event extraction;
JointBeam is the method proposed by DBLP:conf/acl/LiJH13, which extracts events based on structure prediction by manually designed features;
DMCNN is proposed by DBLP:conf/acl/ChenXLZ015, which uses dynamic multi-pooling to keep multiple events’ information;
PSL is proposed by DBLP:conf/aaai/LiuLH016, which uses a probabilistic reasoning model to classify events by using latent and global information to encode the associations between events;
JRNN is proposed by DBLP:conf/naacl/NguyenCG16, which uses a bidirectional RNN and manually designed features to jointly extract event triggers and arguments.
dbRNN is proposed by DBLP:conf/aaai/ShaQCS18, which adds dependency bridges over Bi-LSTM for event extraction.
Table 1 shows the overall performance comparing to the above state-of-the-art methods with golden-standard entities. From the table, we can see that our JMEE framework achieves the best scores for both trigger classification and argument-related subtasks among all the compared methods. There is a significant gain with the trigger classification and argument role labeling performances, which is 2% higher over the best-reported models. These results demonstrate the effectivenesses of our method to incorporate with the graph convolution and syntactic shortcut arcs.
To evaluate the effect of our framework for alleviating the multiple events phenomenon, we divide the test data into two parts (1/1 and 1/N) following DBLP:conf/naacl/NguyenCG16,DBLP:conf/acl/ChenXLZ015 and perform evaluations separately. 1/1 means that one sentence only has one trigger or one argument plays a role in one sentence; otherwise, 1/N is used.
Table 2 illustrates the performance ( scores) of JRNN Nguyen et al. (2016), DMCNN Chen et al. (2015), the two baseline model Embedding+T and CNN in DBLP:conf/acl/ChenXLZ015 and our framework in trigger classification subtask and argument role labeling subatsk. Embedding+T uses word embedding vectors and the traditional sentence-level features in DBLP:conf/acl/LiJH13, while CNN is similar to DMCNN, except that it applies the standard max-pooling mechanism instead of the dynamic multi-pooling mechanism. We can see that our framework significantly outperforms all the other methods, especially in trigger classification subtask. In the 1/N data split of triggers, our framework is 7.9% better than the JRNN, which demonstrates that our method of leveraging syntactic shortcut arcs and self-attention aggregation mechanism is helpful in alleviating the multiple events phenomenon.
We use a sentence “police have arrested four people in connection with the killing” as an example to illustrate the captures features in our self-attention aggregation mechanism by transforming the attention scores to a row-wise heap map in Figure 4. There are two events in the sentence: an Arrest-Jail event triggered by arrested and a Die event triggered by killings. Additionally, the entity police plays an Agent role and the entity four people plays a Person role in the Arrest-Jail event.
As we can see from the Figure 4, in the row of arrested, there are relatively strong connections with arrested (self), four people (its argument) and killings (other event). And in the row of killings, there are also relatively strong connections with killings (self) and arrested (other event). Besides, the words police, four and in also have high scores with killings, which may mean be on account of the context information propagation though syntactic shortcut arcs.
There are several existing approaches exploiting the associations between events in the event extraction task. Some of them alleviate this phenomenon by exploiting various sentence-level features, such as ranking dependencies McClosky et al. (2011), combinational features of triggers and arguments Li et al. (2013), probabilistic soft logic information Liu et al. (2016b, a), trigger-specific features and relational features Yang and Mitchell (2016); Keith et al. (2017). Others also seek features in document-level methods Liao and Grishman (2010); Ji and Grishman (2008); Hong et al. (2011); Reichart and Barzilay (2012); Lu and Roth (2012). The feature-based methods require extensive human engineering, which also essentially affects model performances, and learn them from the unbalanced training data, however, it is difficult for sparse events.
There are also a group of deep learning methods using RNNsNguyen et al. (2016); Sha et al. (2018); Liu et al. (2018) and CNNs Chen et al. (2015); Feng et al. (2016); Nguyen and Grishman (2016) capturing the associations between events. However, sentence-level sequential modeling methods suffer a lot from the low efficiency in capturing very long-range dependencies. Besides, these methods do not fully model the associations between events.
This paper presents a novel deep neural jointly multiple events extraction (JMEE) framework for the task of event extraction, especially for alleviating the multiple-event phenomenon. In our framework, we introduce syntactic shortcut arcs to enhance information flow and adapt the graph convolution network to capture the enhanced representation. Then a self-attention aggregation mechanism is applied to aggregate the associations between events. Besides, we jointly extract event triggers and arguments by optimizing a biased loss function due to the imbalances in the dataset. The experiment results demonstrate the effectiveness of our proposed framework. In the future, we plan to exploit the information of one argument which plays different roles in various events to do better in event extraction task.
We would like to thank Yansong Feng, Ying Zeng, Xiaochi Wei, Qian Liu and Changsen Yuan for their insightful comments and suggestions. We also very appreciate the comments from anonymous reviewers which will help further improve our work. This work is supported by National Natural Science Foundation of China (No. 61751201 and No. 61602490) and National Key R&D Plan (No. 2017YFB0803302).
Event extraction via dynamic multi-pooling convolutional neural networks.In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 167–176.
Proceedings of the 34th International Conference on Machine Learning, pages 933–941.
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 315–323.
Joint event extraction via recurrent neural networks.In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 300–309.
Jointly extracting event triggers and arguments by dependency-bridge RNN and tensor-based argument interaction.In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 5916–5923.