Event Detection (ED), a key subtask of information extraction, aims to detect event of specific type from given text. Specifically, each event in a sentence is marked by a word or phrase called “event trigger”, which expresses the occurence of an event. The task of event detection is to detect event triggers in a sentence and classify them into the corresponding event types. Taking Figure1 as an example, ED is supposed to recognize the event trigger “visited” and classify it to the event type “Meet”.
Syntactic dependency, which expresses interdependence relationship between words in a sentence, is capable of providing key information for event detection. Syntactic dependency includes syntactic structure information, which indicates syntactic connection between two words, and syntactic dependency relation which depicts the specific type of syntactic relation between two words. Figure 1 is an example of syntactic dependency parsing and we can see that syntactic dependency structure is often represented as tree structure. As shown in Figure 1, the words “Putin”, “visited” and “Bush” connected by syntactic structure constitute an event, which means that syntactic structure can help to provide key evidence for event detection. Furthermore, we think that the syntactic dependency relations of one word are the significant indicators to decide whether the word is a trigger or not. For example in Figure 1, “nsubj”, “dobj” and “nmod” are all trigger-related syntactic relations. Specifically, “nsubj” and “dobj” show that “Putin” and “Bush” are the subject and object of “visited” respectively, words connected by “nmod” expresses when and where the event happened. This implies that the word “visited” is more likely to be the trigger of an event. Additionally, according to our statistics on the benchmark ACE2005 dataset, “nsubj”, “dobj” and “nmod” take up 25% of trigger-related syntactic relations (2.5% on average). Therefore, it is important to consider syntactic dependency structure and relation label simultaneously for ED.
Recently, Graph Convolutional Networks (GCN) based models [12, 9, 16] have been proposed to introduce syntactic dependency structure to improve the performance of event detection. These models outperform previous sequence-based models [1, 11, 8, 2] without using syntactic structure information. However, these GCN-based models ignore specific syntactic dependency relation labels. In order to introduce relation labels into GCN, one intuitive approach is to encode different types of syntactic relation with different relation-specific convolutional filters. However, two technical challenges exist in the approach. The first challenge is parameter explosion, since the number of parameters grow rapidly with the number of relation types. Given the size of dataset for event detection is just moderate, models with large amounts of parameters are easy to overfit, which is the reason why existing GCN-based methods for ED [12, 9, 16] ignore specific syntactic relation label. The second challenge is context-free representation of relation. Since syntactic relation label is encoded in the form of relation-specific convolutional filter parameters, each relation label keeps the same representation across all graphs. Actually, the same relation under different contexts convey different information. For example in Figure 1, “nmod” connected with “ranch” conveys where the event happens but “nmod” connected with “November” conveys when the event happens. Preferably, each relation between different word pairs should have context-aware representation, so that different clues for event detection can be expressed.
In this paper, we propose a model named Relation-Aware Graph Convolutional Network (RA-GCN) to overcome above challenges simultaneously. To model the relation between words without parameter explosion, we construct a relation-aware adjacency tensor by extending the element of traditional adjacency matrix to be a vector, which is representation of corresponding relation. Specifically, the element of tensor is initialized as syntactic relation label embedding. Since each type of syntactic relation is distinguished by the label embedding rather than GCN filter, the amount of parameters can be reduced. In this way, a relation-aware aggregation module is designed to aggregate syntactically connected words through specific relation label. Besides, a context-aware relation update module is designed to update relation representation with context semantic information, so that each relation between words holds a context-aware representation of its own. These two modules update word and relation representation respectively, and they work in a mutual promotion way.
To summary, our contributions are as follows.
We propose RA-GCN for ED, which can introduce specific syntactic relation into GCN and is the first study to exploit syntactic dependency structure and relation label in GCN simultaneously.
We design a relation-aware aggregation module to aggregate syntactically connected words through specific relation label and design a context-aware relation update module to update relation representation specifically.
We conduct extensive experiments on a standard ED dataset ACE2005 and experiments show that RA-GCN achieves a new state-of-the-art. Qualitative analysis is conducted to understand how our model works.
2 Related Works
Early methods [5, 4, 7] used elaborately designed lexical and syntactic features to finish event detection as a classification problem, which are called features-based models. These approaches rely on discriminative features to train the model, thus feature designing strategy can influence the performance of model.
Recent studies have shown that neural network based ED models outperform previous feature-based models.
proposed Convolutional Neural Network (CNN) to capture sentence clues without designing complicated features.
introduced Recurrent Neural Network (RNN) to capture sequential contextual information of each word to perform ED task. exploited event argument information through supervised attention to improve the detection of triggers.  investigated gated multi-level attention and hierarchical tagging to detect multiple events in one sentence simultaneously.
Sequence-based neural network models mentioned above does not take syntactic dependency information into consideration, which can provide important clues for event detection. Consequently,  added a novel dependency bridge to BiLSTM, which can help to exploit syntactic tree structure and sequence structure simultaneously. With the rise of GCN , syntactic structure information can be introduced by constructing graph of sentence according to syntactic connectivity between words. GCN can make good use of syntactic structure between words in sentence, which has been proved effectively for event detection.  investigated GCN on syntactic dependency trees and adopted a novel pooling method to detect event triggers.  introduced syntactic shortcut arcs to enhance information flow and attention-based GCN to perform event detection.  proposed a dependency tree based GCN model with aggregative attention to combine multi-order word representation from different GCN layers. However, these GCN-based models ignore the specific syntactic relation label due to the limitation of parameter capacity. In this paper, we propose a model which exploits syntactic relation label effectively, so that the performance can be improved.
In this section, we will describe the preliminary and detail of our method. Figure 2 shows the overall architecture of our proposed model.
Graph Convolutional Networks
GCN , which can be operated to encode graphs, is an extension of convolutional neural network. Normally, a graph with nodes can be represented by an adjacency matrix , where if an edge exists between node and node, otherwise 0. Graph convolution operation aims to gather information from neighbor nodes in the graph, and the operation in GCN layer can be formulated as follows:
where is an adjacency matrix expressing connectivity between nodes, is the input node representations where means the number of nodes and means the input dimension of node representation, is a learnable convolutional filter where means hidden dimension of GCN node representation,
Event detection aims to locate and classify the word or phrase, which expresses the occurence of an event, into the corresponding event type. The word or phrase is also called Event Trigger. Following existing works, we formulate event detection as a sequence labeling task. Each word in a sentence is annotated with a tag by “BIO” schema, which includes “O”, “B-EventType” and “I-EventType”. Tag “O” means that the corresponing word does not trigger any event and “EventType” means a specific type of event, where “B-EventType” means the word is the begining token of an event trigger and “I-EventType” means the inside token of an event trigger. “BIO” schema is proposed because there exist triggers which contain several words, such as “take off”. Therefore, the total number of tags is , where is the number of predefined event types and “” means “O”.
3.2 Embedding Layer
Embedding layer aims to transform each word to a real-valued embedding vector, which contains semantic information and entity type information of the word.
Entities in the sentences are annotated with BIO schema and we transform each entity type label to a real-valued embedding by looking up an entity-type lookup table.
Each word is represented as the concatation of its word embedding and entity type embedding , namely that the input embedding of is , where and denote the dimension of word embedding and entity type embedding respectively.
3.3 BiLSTM Layer
BiLSTM layer is exploited to capture the contextual information for each word. The operation of an LSTM unit can be formulated as
where LSTM() means the operation of LSTM cell on word input embedding , and means the hidden dimension of LSTM cell. BiLSTM operates LSTM from forward and backward directions, which can capture past and future sequential contextual information of a word at each time step. The output of BiLSTM layer is concatation of bi-directional representation, which is used as initial input word representation for RA-GCN Layer.
3.4 Relation-Aware GCN
To introduce syntactic structure information, previous GCN-based event detection methods transform each sentence to a graph according to syntactic dependency parsing. Each word in the sentence is regarded as a node of graph and an adjacency matrix of boolean type is constructed for the sentence to represent syntactic connectivity between words, where means the number of words in the sentence, if an syntactic relation exist between word and word, otherwise 0. Syntactic relation is not distinguished from each other in .
To model the relation between words, we extend the element of adjacency matrix to be a multi-dimensional vector. Accordingly, a relation-aware adjacency tensor is constructed, the element of which is relation representation vector of dimension and can also be understood as channels of . The relation-aware adjacency tensor is initialized according to syntactic relation between words and a lookup table is introduced to transform each type of syntactic relation label into a real-valued embedding. If an syntactic relation exists between word and word, is initialized as the corresponding -dimensional embedding obtained from the lookup table, otherwise a of dimension . Following previous works [10, 17, 3], we construct graph of sentence undirectly, where and are initialized as the same syntactic relation embedding so that opposite relation can be included. For ROOT word in dependency tree, we add a self loop to itself with a special relation ROOT.
RA-GCN aims to produce expressive representation for each word. Each layer of RA-GCN contains two parts, relation-aware aggregation module and context-aware relation update module, which work in the mutual promotion way. Two modules of RA-GCN are described as follows.
3.4.1 Relation-Aware Aggregation Module
Relation-aware aggregation module is to produce representation for each word by aggregating syntactically connected words through relation-aware adjacency tensor . Each element of denotes relation representation between words, thus relation information can be embedded during aggregating. Each dimension of relation representation is regarded as a channel of the tensor and RA-GCN aggregates words from different channels respectively. Relation-aware aggregation operation is defined as follows:
where is relation-aware adjacency tensor from initialization or last RA-GCN layer, is channel slice of and is the number of words in the sentence; is input word representation, where means the input dimension of words; is a learnable filter and is hidden dimension of RA-GCN, means activation function ReLU. Average pooling is exploited because it can cover information from all channels.
3.4.2 Context-Aware Relation Update Module
To produce context-aware relation representation, we use adjacency word representations to update relation representation in adjacency tensor. The operation is defined as follows:
where means the operation of concatation, , means the representation of word and word in current RA-GCN layer after aggregation, is the relation representation between word and word where is the dimension of relation representation, is a learnable transformation matrix where is the hidden dimension of RA-GCN. This operation combines the context semantic information with syntactic relation embedding, so that different information behind relation can be expressed. The updated relation-aware adjacency tensor is fed into the next layer of RA-GCN for relation-aware aggregation.
3.5 Classification Layer
Finally, we feed the representation of each word into a fully-connected network, which is followed by a softmax operation to compute distribution over all event labels:
where transforms word representation to feature score for each event label and
is a bias term. After softmax, event label with the largest probability is chosen as the classification result.
3.6 Bias Loss Function
, we adopt a bias loss function to strengthen the influence of EventType labels during training, since the number of “O” tags is much lager than that of EventType tags. The bias loss function is formulated as follows:
where is the number of sentences, is the number of words in the sentence; is a swifting function which equals to 0 if the tag of word is one of EventType labels, otherwise 1; is a bias weight which can help to enhance the influence of EventType tags.
4.1 Dataset and Evaluation Metrics
We conduct experiments on the standard event detection dataset, ACE2005, which contains 599 documents annotated with 33 types of events. Following previous works [7, 1, 11, 8, 2, 16], we use the same 529 documents for training, the same 40 documents for test and the rest 30 documents for dev. The syntactic dependency parsing is operated by Stanford CoreNLP toolkit.
4.2 Hyper-parameter Setting
The hyper-parameters are tuned on the dev dataset. The word embeddings are pretrained on New York Times corpus with the Skip-gram algorithm and the dimension of word embedding is 100. The dimension of entity type embedding is 25. We set the hidden dimension of BiLSTM to be 100 and the hidden dimension of RA-GCN to be 150 with relation embedding of dimension 50. Following 
, the max length of sentence is set to be 50 by padding shorter sentences and cutting longer ones. We set batch size to be 30, SGD for optimization with learning rate of 0.1. The dropout rate is 0.6 and L2-norm is 1e-5. The bias loss parameteris 5. Additionally, we set RA-GCN to be 2 layers, since GCN suffers from over-smoothing which means the node representation will get more and more similar with the number of GCN layers increase.
We select following models for comparison, which can be classified as three types: feature-based models, sequence-based neural network models and GCN-based neural network models.
4.3.1 Feature-based Models
4.3.2 Sequence-based Neural Network Models
1) DMCNN, proposed by , uses dynamic multi-pooling convolutional network; 2) JRNN, proposed by , employs bidirectional RNN for the task; 3) ANN-AugAtt, proposed by , uses annotated event argument information with supervised attention, where words describing Time, Place and Person of event get larger attention score; 4) dbRNN, proposed by  adds dependency arcs with weight to BiLSTM to make use of tree structure and sequence structure simultaneously; 5) HBTNGMA, proposed by , uses hierarchical and bias tagging networks to detect multiple events in one sentence collectively.
4.3.3 GCN-based Models
1) GCN-ED is proposed by , which investigates GCN on syntactic dependency tree structure to improve performance; 2) JMEE is proposed by , which uses GCN with self-attention and highway network to improve performance of GCN for event detection; 3) MOGANED is proposed by , which uses GCN with aggregated attention to combine multi-order word representation from different GCN layers.
RGCN , which models relation data with relation-specific adjacency matrix and convolutional filter , has been implemented for knowledge completion. We adapt RGCN for the task of event detection and the result is shown in Table 1.
4.4 Overall Performance
Table 1 shows the performance of our model and other comparable baseline models. We can see that RA-GCN achieves the best performance among all models above, which improves the best beseline 1.9% on -measure. Besides, we also have observations as follows:
1) RA-GCN outperforms the best GCN-based models, which demonstrates that syntactic relation label can provide key information for event detection. 2) RA-GCN outperforms RGCN and the reasons can be analyzed from two aspects. First, the amount of parameters is smaller than that of RGCN, thus the model can be trained better on limited data. Second, context-aware relation representation can provide more information for event detection to improve performance. 3) Comparing RA-GCN with dbRNN, which adds syntactic dependency arcs with a weight to BiLSTM, our model gains improvement on both P and R. The reason is that GCN can capture dependency tree structure more effectively and the multi-dimensional embedding of syntactic relation in RA-GCN can learn more information than just a weight in dbRNN. 4) RA-GCN outperforms all sequence-based neural network models, which demonstrates that reasonable use of syntactic dependency information can improve performance.
5.1 Ablation Study
|– RAAM & CARUM||74.82|
To study the contribution of RA-GCN core components, we design ablation experiments. Based on experiment results in Table 2, we have analysis as follows. 1) –RAAM: To study whether syntactic label help to improve the performance of RA-GCN, we initialize each element of relation-aware adjacency tensor as the same representation, which means only syntactic dependency structure is exploited. The result drops nearly 2.1% on -measure, which demonstrates that syntactic dependency label can provide information to improve the performance of RA-GCN. 2) –MdR: To study whether multi-dimensional representation of relation help to enhance the ability to capturing information, we set the dimension of relation representation to be 1, which means the relation-aware adjacency tensor is compressed to be . We can see that -measure drops 4% approximately, which demonstrates that multi-dimensional representation can learn more information than just a scalar weight. 3) –CARUM: To study whether context-aware relation representation help to improve performance, we remove context-aware relation update module in RA-GCN. The performance degrades 2.4%, which illustrates that context-aware relation representation can provide more evident information for event detection. 4) –RAAM & CARUM: To study whether “relation” can help GCN to work better, we remove relation-aware aggregation module and context-aware relation update module simultaneously, which means only vallina GCN is used. We can see that the performance reduces by 2.8%, which illustrates that “relation” can help to capture information which vallina GCN can not capture. 5) –BiLSTM: BiLSTM is removed before RA-GCN and the performance drops terribly. This illustrates that BiLSTM can capture important sequential information which GCN miss. Therefore, GCN and BiLSTM can be complementary to each other for event detection task.
5.2 Effect of Relation Representation Dimension
Since performance of RA-GCN drops when the dimension of relation representation is 1, we investigate the influence of multi-dimensional representation of relation. As shown in Figure 3, -measure increases with the increasing dimension of relation representation at first, but degrades when the dimension is up to 50. Reasons for this phenomenon can be explained as follows. If the dimension of relation representation is too small, it can not capture full information of relation between words. On the contrary, if the dimension is large, which may also hurt the performance because of overfitting.
5.3 Efficiency Advantage
To prove that RA-GCN architecture can model relation data more effectively, we compare RA-GCN architecture with RGCN architecture from two aspects: space consumption and inference speed. According to our statistics, the amount of parameters of RA-GCN architecture and RGCN architecture are 2,385,584 and 4,115,184 respectively. Since RA-GCN initializes syntactic dependency label as parameters of look up table rather than those of GCN filters, the amount of parameters is reduced significantly. We can see that the amount of parameters of RA-GCN architecture is only 57.9% that of RGCN, which means that RA-GCN has superiority of space complexity. Besides, the inference speed of RA-GCN architecture is 7.67 times that of RGCN, which means that RA-GCN is more light-weight. Since RA-GCN is more efficient in both inference speed and space consumption, RA-GCN has better applicability when modeling relation data.
5.4 Case Study
We use a sentence “Putin last visited Bush at his Texas ranch in November 2001” as an example to illustrate the connection strength between words. Following , we use the length of relation representation vector in adjacency tensor to represent the corresponding connection strength score. As we can see from each row in Figure 4, each word has the strongest connection with “visited”, which is exactly the event trigger and ROOT in syntactic dependency tree. In the row of “visited”, it has strongest connections with “Putin”, “ranch”, “November” and “Bush” , which shows that “nsubj”, “nmod” and “dobj” are strong relations for event trigger. Strong connection with event-related words (“Putin”, “Bush”, “ranch” and “November”) means that these words contribute more information in relation-aware aggregation to produce expressive representation for the event trigger “visited”, which can provide evident clue for event detection.
6 Conclusion and Future Works
In this paper, we propose a novel model named Relation-Aware Graph Convolutional Network (RA-GCN) for event detection, which can exploit syntactic dependency relation label efficiently and model the relation between words specifically. Experiments on the standard event detection dataset ACE2005 show that our proposed model outperforms all baseline models. In the future, we will take into account the direction of syntactic dependency in RA-GCN. Besides, we will attempt to exploit RA-GCN model for relation extraction and other subtasks of information extraction.
Event extraction via dynamic multi-pooling convolutional neural networks.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 167–176. External Links: Cited by: §1, §2, §3.2, §4.1, §4.1, §4.3.2, Table 1.
-  (2018-October-November) Collective event detection via a hierarchical and bias tagging networks with gated multi-level attention mechanisms. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1267–1276. External Links: Cited by: §1, §2, §3.2, §3.6, §4.1, §4.1, §4.3.2, Table 1.
-  (2019-07) Attention guided graph convolutional networks for relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 241–251. External Links: Cited by: §3.4.
-  (2011-06) Using cross-entity inference to improve event extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 1127–1136. External Links: Cited by: §2, §4.3.1, Table 1.
-  (2008-06) Refining event extraction through cross-document inference. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 254–262. External Links: Cited by: §2.
-  (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §2, §3.1.
-  (2013-08) Joint event extraction via structured prediction with global features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 73–82. External Links: Cited by: §2, §4.1, §4.3.1, Table 1.
-  (2017-07) Exploiting argument information to improve event detection via supervised attention mechanisms. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1789–1798. External Links: Cited by: §1, §2, §3.2, §4.1, §4.1, §4.3.2, Table 1.
-  (2018-October-November) Jointly multiple events extraction via attention-based graph information aggregation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1247–1256. External Links: Cited by: §1, §2, §4.3.3, Table 1.
-  (2017-09) Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1506–1515. External Links: Cited by: §3.4.
-  (2016-06) Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 300–309. External Links: Cited by: §1, §2, §4.1, §4.1, §4.3.2, Table 1.
Graph convolutional networks with argument-aware pooling for event detection.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2, §4.3.3, Table 1.
-  (2017) Dynamic routing between capsules. In Advances in neural information processing systems, pp. 3856–3866. Cited by: §5.4.
-  (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §4.3.3, Table 1.
-  (2018) Jointly extracting event triggers and arguments by dependency-bridge rnn and tensor-based argument interaction. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2, §4.3.2, Table 1.
-  (2019-11) Event detection with multi-order graph convolution and aggregated attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5765–5769. External Links: Cited by: §1, §2, §3.2, §3.6, §4.1, §4.1, §4.2, §4.3.3, Table 1.
-  (2018-October-November) Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2205–2215. External Links: Cited by: §3.4.