Event extraction (EE), traditionally modeled as detecting trigger words and extracting corresponding arguments from plain text, plays a vital role in natural language processing since it can produce valuable structured information to facilitate a variety of tasks, such as knowledge base construction, question answering, language understanding, etc.
In recent years, with the rising trend of digitalization within various domains, such as finance, legislation, health, etc., EE has become an increasingly important accelerator to the development of the business in those domains. Take the financial domain as an example, continuous economic growth has witnessed exploding volumes of digital financial documents, such as financial announcements in a specific stock market as shown in Figure 1, specified as the Chinese financial announcements (ChFinAnn). While forming up a gold mine, such large amounts of announcements call EE for facilitating people to extract valuable structured information hidden in massive plain text to sense emerging risks and find profitable opportunities.
Given the necessity of applying EE on the financial domain, the specific characteristics of financial documents as well as those within many other business fields, however, raise two critical challenges to EE, particularly arguments-scattering and multi-event. Specifically, the first challenge indicates that arguments of one event record may scatter across multiple sentences of the document, while the other one reflects that a document is likely to contain multiple such event records. To intuitively illustrate these challenges, we show a typical ChFinAnn document with two Equity Pledge event records in Figure 2, where ID denotes the sentence index. For the first event, the entity333In this paper, we use “entity” as a general notion that includes named entities, numbers, percentages, etc, for brevity. “[SHARE1]” is the correct Pledged Shares at the sentence level (ID 5). However, due to the capital stock increment (ID 7), the correct Pledged Shares at the document level should be “[SHARE2]”. Similarly, “[DATE3]” is the correct End Date at the sentence level (ID 9) but incorrect at the document level (ID 10). Moreover, some summarized arguments, such as “[SHARE5]” and “[RATIO]”, often occur at the end of the document.
Although a great number of efforts [Ahn2006, Ji and Grishman2008, Liao and Grishman2010, Hong et al.2011, Riedel and McCallum2011, Li et al.2013, Li et al.2014, Chen et al.2015, Yang and Mitchell2016, Nguyen et al.2016, Liu et al.2017, Sha et al.2018, Zhang and Ji2018, Nguyen and Nguyen2019] have been put on EE, most of them are based on ACE 2005444https://www.ldc.upenn.edu/collaborations/past-projects/ace, an expert-annotated benchmark, which only tagged event arguments within the sentence scope. We refer to such task as the sentence-level EE (SEE), which obviously overlooks the arguments-scattering challenge. In contrast, EE on financial documents, such as ChFinAn, requires document-level EE (DEE) when facing arguments-scattering, and this challenge gets much harder when coupled with multi-event. A most recent work, DCFEE [Yang et al.2018], attempted to explore DEE on ChFinAnn, by employing distant supervision (DS) [Mintz et al.2009]
to generate EE data and performing a two-stage extraction: 1) a sequence tagging model for SEE, and 2) a key-event-sentence detection model to detect the key sentence and an arguments-completion strategy that padded missing arguments from surrounding sentences for DEE. However, the sequence tagging model for SEE cannot handle multi-event sentences elegantly; furthermore, the context-agnostic arguments-completion strategy fails to address the arguments-scattering challenge efficiently.
In this paper, we propose a novel end-to-end solution, Doc2EDAG, to address the unique challenges of DEE. The key idea of Doc2EDAG is to transform the event table into an entity-based directed acyclic graph
(EDAG). The EDAG format can transform the hard table-filling task into several sequential path-expanding sub-tasks that are more tractable. To support the EDAG generation efficiently, Doc2EDAG encodes entities with the document-level contexts and designs a memory mechanism for the path expanding. Moreover, to ease the DS-based document-level labeling, we propose a novel DEE formalization that removes the trigger-words labeling and regards DEE as directly filling event tables based on a document. This design does not rely on any predefined trigger-words set or heuristic to filter multiple trigger candidates, and still perfectly matches the ultimate goal of DEE, mapping a document to underlying event tables.
To evaluate the effectiveness of our proposed Doc2EDAG, we conduct experiments on a real-world dataset, consisting of large scales of financial announcements. In contrast to the dataset used by DCFEE where 97%555Estimated by their Table 1 as (2*NO.ANN-NO.POS)/NO.ANN documents just contained one event record, our data collection is ten times larger where about 30% documents include multiple event records. Extensive experiments demonstrate that Doc2EDAG can significantly outperform state-of-the-art methods when facing DEE-specific challenges.
In summary, our contributions include:
We propose a novel solution, Doc2EDAG, which can directly generate event tables given a document, to address unique challenges of DEE efficiently.
We reformalize a DEE task without trigger words to ease the DS-based document-level event labeling.
We build a large-scale real-world dataset for DEE with the unique challenges of arguments-scattering and multi-event, the extensive experiments on which demonstrate the superiority of Doc2EDAG.
Note that, though we focus on ChFinAnn data in this work, our novel labeling and modeling strategies are quite general and can benefit many other business domains with similar challenges, such as criminal facts and judgments extraction from legal documents, disease symptoms and doctor instructions identification from medical diagnostic reports, etc.
2 Related Work
Recent development on information extraction has been advancing in building the joint model that can extract entities and identify structures (relations or events) among them simultaneously. For instance, [Ren et al.2017, Zheng et al.2017, Zeng et al.2018a, Wang et al.2018] focused on jointly extracting entities and inter-entity relations. In the meantime, the same to the focus of this paper, a few studies aimed at designing joint models for the entity and event extraction, such as handcrafted-feature-based [Li et al.2014, Yang and Mitchell2016, Judea and Strube2016]
and neural-network-based[Zhang and Ji2018, Nguyen and Nguyen2019] models. Nevertheless, these models did not present how to handle argument candidates beyond the sentence scope. [Yang and Mitchell2016] claimed to handle event-argument relations across sentences with the prerequisite of well-defined features, which, unfortunately, is nontrivial.
In addition to the modeling challenge, another big obstacle for democratizing EE is the lack of training data due to the enormous cost to obtain expert annotations. To address this problem, some researches attempted to adapt distant supervision (DS) to the EE setting, since DS has shown promising results by employing knowledge bases to generate training data for relation extraction [Mintz et al.2009] automatically. However, the vanilla EE required the trigger words that were absent on factual knowledge bases. Therefore, [Chen et al.2017, Yang et al.2018] employed either linguistic resources or predefined dictionaries for trigger-words labeling. On the other hand, another recent work [Zeng et al.2018b] showed that directly labeling event arguments without trigger words was also feasible.
We first clarify several key notions: (1) entity mention: an entity mention is a text span that refers to an entity object; (2) event role: an event role corresponds to a predefined field of the event table; (3) event argument: an event argument is an entity that plays a specific event role; (4) event record: an event record corresponds to an entry of the event table and contains several arguments with required roles. For example, Figure 2 shows two event records, where the entity “[PER]” is an event argument with the Pledger role.
To better elaborate and evaluate our proposed approach, we leverage the ChFinAnn data in this paper. ChFinAnn documents contain firsthand official disclosures of listed companies in the Chinese stock market and have hundreds of types, such as annual reports and earnings estimates. While in this work, we focus on those event-related ones that are frequent, influential and mainly expressed by the natural language.
4 Document-level Event Labeling
As a prerequisite to DEE, we first conduct the DS-based event labeling at the document level by mapping tabular records from an event knowledge base to document text and then regarding well-matched records as events expressed by that document, followed by which we introduce details of the event labeling with the no-trigger-words design and propose the corresponding DEE task.
To ensure the labeling quality, we set two constraints for matched records: 1) arguments of predefined key event roles must exist (non-key ones can be empty) and 2) the number of matched arguments should be higher than a certain threshold. Configurations of these constraints are event-specific, and in practice, we can tune them to directly ensure the labeling quality at the document level. Note that, we do not label trigger words explicitly. Besides not affecting the DEE functionality, an extra benefit of such no-trigger-words design is a much easier DS-based labeling that does not rely on the predefined trigger-words set and heuristics to filter multiple potential trigger words. Some previous studies [Zeng et al.2018b] attempted to use a similar no-trigger-words design, but they only considered the SEE setting and cannot be directly applied to the DEE setting. Moreover, we assign roles of arguments to matched tokens as token-level entity tags.
Novel DEE Task.
We reformalize a novel task for DEE as directly filling event tables based on a document, which generally requires: 1) event detection, judging a document to be triggered or not for each event type, 2) entity extraction, extracting entity mentions as argument candidates, and 3) event table filling, filling arguments into the table of triggered events. This novel formalization is much different from the vanilla SEE with trigger words and is consistent with the above simplified DS-based event labeling.
The key idea of Doc2EDAG is to transform tabular event records into an EDAG and let the model learn to generate this EDAG based on document-level contexts. For instance, Figure 3 shows the corresponding EDAG for the event example in Figure 2. The general architecture of Doc2EDAG, as shown in Figure 4, consists of two key stages, the document-level entity encoding (Section 5.1) and the EDAG generation (Section 5.2). Before elaborating each of them in this section, we first describe two preconditioned modules: the input representation and entity recognition.
In this paper, we denote a document as a sequence of sentences. Formally, after looking up the token embedding table , we denote a document as a sentence sequence and each sentence is composed of a sequence of token embeddings as , where is the vocabulary size, and are the maximum lengths of the sentence sequence and the token sequence, respectively, and is the embedding of token in sentence with the embedding size .
Entity recognition is a typical sequence tagging task. We conduct this task at the sentence level and follow a classic method, BI-LSTM-CRF [Huang et al.2015], that first encodes the token sequence and then adds a conditional random field (CRF) layer to facilitate the sequence tagging. The only difference is that we employ the Transformer [Vaswani et al.2017] instead of the original encoder, LSTM [Hochreiter and Schmidhuber1997]
. Transformer encodes a sequence of embeddings by the multi-headed self-attention mechanism to exchange contextual information among them. Due to the superior performance of the Transformer, we employ it as a primary context encoder in this work and name the Transformer module used in this stage as Transformer-1. Formally, for each sentence tensor, we get the encoded one as , where shares the same embedding size and sequence length . During training, we employ roles of matched arguments as entity labels with the classic BIO (Begin, Inside, Other) scheme and wrap with a CRF layer to get the entity-recognition loss . As for the inference, we use the Viterbi decoding to get the best tagging sequence.
5.1 Document-level Entity Encoding
To address the arguments-scattering challenge efficiently, it is indispensable to leverage global contexts to better identify whether an entity plays a specific event role. Consequently, the document-level entity encoding stage aims to encode extracted entity mentions with such contexts and produce an embedding of size for each entity mention.
Entity & Sentence Embedding.
As an entity mention usually covers multiple tokens with a variable length, we first obtain a fixed-sized embedding for each entity mention by wrapping an attentively weighted average (AWA) module over its covered token embeddings, where the AWA module was introduced by [Yang et al.2016] to aggregate a single embedding from multiple token embeddings. For example, given entity mention covering to tokens of sentence, we feed into the AWA module to get the entity mention embedding . For each sentence , we employ another AWA module over the encoded token sequence to obtain a single sentence embedding . At the end of this stage, both the mention and the sentence embedding share the same embedding size .
To enable the awareness of document-level contexts, we employ the second Transformer module, Transformer-2, to facilitate the information exchange between all entity mentions and sentences. Before feeding into Transformer-2, we add the sentence position embedding to inform the sentence order. After the Transformer encoding, we also employ an AWA module to merge entity mention embeddings with the same entity surface name into a single embedding. Formally, after this stage, we obtain document-level context-aware mention and sentence representations as and , respectively, where is the number of distinct entity surface names. These aggregated representations serve the next stage to fill event tables directly.
5.2 EDAG Generation
After the document-level entity encoding stage, we first obtain the document embedding by wrapping another AWA module over the sentence tensor and conduct the event-triggering classification for each event type. Afterward, we generate an EDAG for each triggered event.
The EDAG format aims to simplify the hard event-table-filling task into several tractable path-expanding sub-tasks. For each event type, we first manually define an event role order. Then, we transform each event record into a linked list of arguments following this order, where each argument node is either an entity or a special empty argument NA. Next, we merge these linked lists into an EDAG by sharing the same prefix path. Finally, every complete path of the EDAG corresponds to one row of the raw event table. In Figure 4, we show a toy EDAG with three event roles.
When generating the EDAG sequentially, it is crucial to consider both document-level contexts and entities already in the path. Therefore, we design a memory mechanism that initializes a memory tensor with the sentence tensor and updates when expanding the EDAG by appending either the associated entity embedding or the zero-padded one for the NA argument.
When expanding the path for the event role , we conduct a binary classification for each entity, expanding (1) or not (0). To enable the awareness of the current path state, history contexts and the current role, we first concatenate the memory tensor and the entity tensor , then add them with a trainable event-role-indicator embedding, and encode them with the third Transformer module, Transformer-3, to facilitate the context-aware reasoning. Afterward, we extract the enriched entity tensor from outputs of Transformer-3 and conduct the path-expanding classification based on .
For the event-triggering classification, we calculate the cross-entropy loss . During the EDAG generation, we also calculate a cross-entropy loss for each path-expanding classification, and sum these losses as the final EDAG-generation loss . Finally, we sum , and the entity-recognition loss as , where , and are hyper-parameters. Moreover, when calculating , we set as the negative class weight for the cross-entropy loss to penalize false positive errors, since such errors can result in a completely wrong path. During training, we use the scheduled sampling [Bengio et al.2015] to adaptively switch the inputs of the document-level entity encoding from ground-truth entity mentions to model recognized ones.
As for the inference, Doc2EDAG first recognizes entity mentions from sentences, then encodes them with document-level contexts, and finally generates an EDAG for each triggered event type to directly fill event tables.
In the following of this section, we start with details of the experimental setup, followed by comprehensive performance comparisons to demonstrate the superiority of Dos2EDAG on handling DEE-specific challenges.
6.1 Experimental Setup
Data Collection with Event Labeling.
We utilize ten years (2008-2018) ChFinAnn666Crawling from http://www.cninfo.com.cn/new/index documents and human-summarized event knowledge bases to conduct the DS-based event labeling, which follows steps introduced in Section 4. We focus on five event types: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO) and Equity Pledge (EP), which belong to major events required to be disclosed by the regulator and may have a huge impact on the company value. To ensure the labeling quality, we set constraints for matched document-record pairs as described in Section 4. Moreover, we directly use the character-level tokenization to avoid the error propagation from Chinese word segmentation tools.
Finally, we obtain documents in total, and this number is ten times larger than of DCFEE and about times larger than of ACE 2005. We divide these documents into train, development, and test set with the proportion of based on the time order. In Table 1, we show the number of documents and the multi-event ratio for each event type on this dataset. Note that some documents may share multiple event types.
To verify the quality of DS-based event labeling, we randomly select documents and manually annotate them. By regarding DS-generated event tables as the prediction and human-annotated ones as the ground-truth, we evaluate the labeling quality based on the metric introduced below. Table 2 shows this approximate evaluation, and we can observe that DS-generated data are pretty good, achieving high precision and acceptable recall. In later experiments, we directly employ the automatically generated test set for evaluation due to its much broad coverage.
The ultimate goal of DEE is to fill event tables with correct arguments for each role. Therefore, we directly compare the predicted event table with the ground-truth one for each event type. Specifically, for each document and each event type, we pick one predicted record and one ground-truth record (at least one of them is non-empty) from associated event tables without replacement to calculate event-role-specific true positive, false positive and false negative statistics until no record left. After aggregating these statistics among all evaluated documents, we can calculate role-level precision, recall, and F1 scores. As an event type often includes multiple roles, we calculate micro-averaged role-level scores as the final event-level metric that directly represents the event-table-filling ability.
For the input, we set the maximum number of sentences and the maximum sentence length as and , respectively. During training, we set , and . Besides, we employ the Adam [Kingma and Ba2015] optimizer with the learning rate , train for at most epochs and pick the best epoch by the validation score on the development set.
Note that, we leave other detailed hyper-parameters, event type specifications, and detailed model structures to Section A. Moreover, the dataset and codes will be publicly available to make experimental results reported in this paper reproducible and facilitate more researches towards DEE.
|S.||M.||S.||M.||S.||M.||S.||M.||S.||M.||S.||M.||S. & M.|
6.2 Performance Comparisons
As discussed in the related work, the state-of-the-art method applicable to our setting is DCFEE. We follow the implementation described in [Yang et al.2018], but they did not illustrate how to handle multi-event sentences in details. Thus, we develop two versions, DCFEE-O and DCFEE-M, where DCFEE-O only produces one event record from one key-event sentence, while DCFEE-M tries to get multiple possible argument combinations by the closest relative distance from the key-event sentence. To be fair, the SEE stage of both versions shares the same neural architecture as the entity recognition part of Doc2EDAG. Besides, we further employ a simple decoding baseline of Doc2EDAG, GreedyDec, that fills only one entry of the event table by utilizing recognized entity roles greedily to demonstrate the importance of using document-level contexts.
As Table 3 shows, Doc2EDAG achieves significant improvements over all baselines for all event types. Specifically, Doc2EDAG improves , , , and F1 scores over DCFEE-O on EF, ER, EU, EO and EP events, respectively. These vast improvements mainly owe to the document-level end-to-end modeling of Doc2EDAG. Moreover, since we work on automatically generated data, the direct document-level supervision can be more robust than the extra sentence-level supervision used in DCFEE, which assumes the sentence containing most event arguments as the key-event one. This assumption does not work well on some event types, such as EF, EU and EO, on which DCFEE-O is even inferior to the most straightforward baseline, GreedyDec. Besides, DCFEE-O achieves better results than DCFEE-M, which demonstrates that naively guessing multiple events from the key-event sentence cannot work well. By comparing Doc2EDAG with GreedyDec that owns high precision but low recall, we can explicitly see the benefit of the document-level end-to-end modeling.
Single-Event vs. Multi-Event.
We divide the test set into a single-event set, containing documents with just one event record, and a multi-event set, containing others, to show the extreme difficulty when arguments-scattering meets multi-event. Table 4 shows F1 scores for different scenarios. Although Doc2EDAG still maintains the highest extraction performance for all cases, the multi-event set is extremely challenging as the extraction performance of all models drops significantly. Especially, GreedyDec decreases most drastically due to no mechanism to generate multiple event records. DCFEE-O decreases less, but is still far away from Doc2EDAG. We can observe that on the multi-event set, Doc2EDAG with the document-level end-to-end modeling increases by F1 scores over DCFEE-O, the best baseline, on average.
To verify the critical contributions of the novel model, we conduct ablation tests by evaluating two Doc2EDAG variants: 1) -DocEnc, removing the Transformer module used in the document-level entity encoding, and 2) -PathMem, removing the memory mechanism during the EDAG generation. Table 5 compares Doc2EDAG with its two variants. We can observe that removing any one of those two modules causes apparent performance degradation. Specifically, the memory mechanism is of prime importance, as removing it can result in drastic declines of the performance, over F1 scores, on four event types, except for the ER event whose multi-event ratio is very low on the test set.
Taking the document shown in Figure 2 as an example, Doc2EDAG successfully generate the correct EDAG, as shown in Figure 3, while DCFEE inevitably makes many mistakes even if identifying key sentences correctly. Moreover, we include another three different case studies in the appendix, which can intuitively illustrate the superiority of Doc2EDAG further.
We propose a novel end-to-end solution, Doc2EDAG, for DEE that can directly operate at the document level. To ease the DS-based event labeling, we reformalize a novel DEE task with the no-trigger-words design. With this design, we build a large-scale real-world dataset with unique challenges of argument-scattering and multi-event. Furthermore, based on this dataset, we conduct extensive experiments and present comprehensive analyses to illustrate the superiority of Doc2EDAG. Notably, our fairly general labeling and modeling strategies hold vast potentials to benefit other domains with similar challenges, such as legislation, health, etc.
- [Ahn2006] David Ahn. The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, 2006.
- [Ba et al.2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[Bengio et al.2015]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer.
Scheduled sampling for sequence prediction with recurrent neural networks.In NIPS, 2015.
[Chen et al.2015]
Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao.
Event extraction via dynamic multi-pooling convolutional neural networks.In ACL-IJCNLP, 2015.
- [Chen et al.2017] Yubo Chen, Shulin Liu, Xiang Zhang, Kang Liu, and Jun Zhao. Automatically labeled data generation for large scale event extraction. In ACL, 2017.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
- [Hong et al.2011] Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao, Guodong Zhou, and Qiaoming Zhu. Using cross-entity inference to improve event extraction. In ACL, 2011.
- [Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
- [Ji and Grishman2008] Heng Ji and Ralph Grishman. Refining event extraction through cross-document inference. In ACL, 2008.
- [Judea and Strube2016] Alex Judea and Michael Strube. Incremental global event extraction. In COLING, 2016.
- [Kingma and Ba2015] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- [Li et al.2013] Qi Li, Heng Ji, and Liang Huang. Joint event extraction via structured prediction with global features. In ACL, 2013.
- [Li et al.2014] Qi Li, Heng Ji, HONG Yu, and Sujian Li. Constructing information networks using one single model. In EMNLP, 2014.
- [Liao and Grishman2010] Shasha Liao and Ralph Grishman. Using document level cross-event inference to improve event extraction. In ACL, 2010.
- [Liu et al.2017] Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao. Exploiting argument information to improve event detection via supervised attention mechanisms. In ACL, 2017.
- [Mintz et al.2009] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003–1011, 2009.
- [Nguyen and Nguyen2019] Trung Minh Nguyen and Thien Huu Nguyen. One for all: Neural joint modeling of entities and events. In AAAI, 2019.
- [Nguyen et al.2016] Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. Joint event extraction via recurrent neural networks. In NAACL, 2016.
- [Ren et al.2017] Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R Voss, Heng Ji, Tarek F Abdelzaher, and Jiawei Han. Cotype: Joint extraction of typed entities and relations with knowledge bases. In WWW, 2017.
- [Riedel and McCallum2011] Sebastian Riedel and Andrew McCallum. Fast and robust joint models for biomedical event extraction. In EMNLP, 2011.
- [Sha et al.2018] Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui. Jointly extracting event triggers and arguments by dependency-bridge rnn and tensor-based argument interaction. In AAAI, 2018.
- [Srivastava et al.2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
- [Wang et al.2018] Shaolei Wang, Yue Zhang, Wanxiang Che, and Ting Liu. Joint extraction of entities and relations based on a novel graph scheme. In IJCAI, 2018.
- [Yang and Mitchell2016] Bishan Yang and Tom M. Mitchell. Joint extraction of events and entities within a document context. In NAACL-HLT, 2016.
- [Yang et al.2016] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In NAACL-HLT, 2016.
- [Yang et al.2018] Hang Yang, Yubo Chen, Kang Liu, Yang Xiao, and Jun Zhao. Dcfee: A document-level chinese financial event extraction system based on automatically labeled training data. Proceedings of ACL 2018, System Demonstrations, 2018.
- [Zeng et al.2018a] Xiangrong Zeng, Daojian Zeng, Shizhu He, Kang Liu, and Jun Zhao. Extracting relational facts by an end-to-end neural model with copy mechanism. In ACL, 2018.
- [Zeng et al.2018b] Ying Zeng, Yansong Feng, Rong Ma, Zheng Wang, Rui Yan, Chongde Shi, and Dongyan Zhao. Scale up event extraction learning via automatic training data generation. In AAAI, 2018.
- [Zhang and Ji2018] Tongtao Zhang and Heng Ji. Event extraction with generative adversarial imitation learning. arXiv preprint arXiv:1804.07881, 2018.
- [Zheng et al.2017] Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. Joint extraction of entities and relations based on a novel tagging scheme. In ACL, 2017.
Appendix A Appendix
We omit some details in the main body to present readers with the overall picture of Doc2EDAG and provide better reading experiences. In this appendix, we include those omitted details for interested readers to check.
Section A.1 introduces event types used in our paper and corresponding preprocessing details when conducting the document-level event labeling.
Section A.2 describes some basic building blocks of our model in details.
Section A.3 presents a complete hyper-parameter setting to enable the reproducible research.
Section A.4 contains another three case studies.
a.1 Event Type Specifications
shows detailed illustrations for event types used in our paper, where we mark some key roles that should be non-empty when conducting the document-level event labeling. In addition to requiring non-empty key roles, we empirically set the minimum number of matched roles for EF, ER, EU, EO and EP events as 5, 4, 4, 4 and 5, respectively. While we set these constraints empirically to ensure the labeling quality for our data, practitioners of other domains can adjust these configurations freely to fulfill the task-specific requirements by making the trade-off between precision and recall.
Moreover, when training models, we follow the decreasing order of the non-empty arguments ratio of the role, based on the intuition that more informative (non-empty) arguments in the path history can facilitate better subsequent argument identifications during the recurrent decoding, and we also validate this intuition by comparing with models trained on some randomly permuted role orders.
|Event Type||Event Role||Detailed Explanations|
|Equity Freeze (EF)||Equity Holder (key)||the equity holder whose shares are froze|
|Froze Shares (key)||the number of shares being froze|
|Legal Institution (key)||the legal institution that executes this freeze|
|Start Date||the start date of this freeze|
|End Date||the end date of this freeze|
|Unfroze Date||the date in which these shares are unfroze|
|Total Holding Shares||the total number of shares being hold at disclosing time|
|Total Holding Ratio||the total ratio of shares being hold at disclosing time|
|Equity Repurchase (ER)||Company Name (key)||the name of the company|
|Highest Trading Price||the highest trading price|
|Lowest Trading Price||the lowest trading price|
|Closing Date||the closing date of this disclosed repurchase|
|Repurchased Shares||the number of shares being repurchased before the closing date|
|Repurchase Amount||the repurchase amount before the closing date|
|Equity Underweight (EU)||Equity Holder (key)||the equity holder who conducts this underweight|
|Traded Shares (key)||the number of shares being traded|
|Start Date||the start date of this underweight|
|End Date||the end date of this underweight|
|Average Price||the average price during this underweight|
|Later Holding Shares||the number of shares being hold after this underweight|
|Equity Overweight (EO)||Equity Holder (key)||the equity holder who conducts this overweight|
|Traded Shares (key)||the number of shares being traded|
|Start Date||the start date of this overweight|
|End Date||the end date of this overweight|
|Average Price||the average price during this overweight|
|Later Holding Shares||the number of shares being hold after this overweight|
|Equity Pledge (EP)||Pledger (key)||the equity holder who pledges some shares to an institution|
|Pledged Shares (key)||the number of shares being pledged|
|Pledgee (key)||the institution who accepts the pledged shares|
|Start Date||the start date of this pledge|
|End Date||the end date of this pledge|
|Released Date||the date in which these pledged shares are released|
|Total Pledged Shares||the total number of shares being pledged at disclosing time|
|Total Holding Shares||the total number of shares being hold at disclosing time|
|Total Holding Ratio||the total ratio of shares being hold at disclosing time|
In Section 5 of the paper, we frequently employ two basic building blocks: 1) the attentively weighted average (AWA) module [Yang et al.2016] and 2) the Transformer module [Vaswani et al.2017]. Next, we present details about these two modules.
The AWA module (mentioned in Section 5.1 and 5.2 of the paper) was employed by [Yang et al.2016] to obtain the sentence embedding from a sequence of word embeddings and produce a final document embedding from a sequence of generated sentence embeddings. In our model, we adopt a similar design to get a single embedding from a sequence of embeddings with the same embedding size. Specifically, given a sequence of embeddings, , where each is an embedding with size and is the embedding sequence with length , we take the scaled dot-product attention [Vaswani et al.2017] operations:
to produce a single embedding , where is a trainable parameter, LayerNorm is the layer normalization [Ba et al.2016] and Dropout is an effective technique to avoid overfitting [Srivastava et al.2014].
As described in Section 5 of the paper, we employ four different AWA modules for the following purposes:
getting a single entity mention embedding from a sequence of embeddings of covered tokens;
getting a single sentence embedding from the token embedding sequence;
inducing a single entity embedding from multiple mention embeddings with the same entity name;
obtaining a document embedding from encoded sentence embeddings to conduct the event-triggering classification.
Some alternatives of AWA include max pooling and mean pooling. We have also tried these two strategies and our observations was that AWA performed comparable and sometimes slightly better than max pooling, and both of them performed slightly better than mean pooling. To focus on core contributions, we simply adopt AWA and do not add many discussions about different choices of this module. In general, using any of them is ok, as pooling-based methods can be quicker and have fewer parameters.
As for the Transformer module, we mainly follow [Vaswani et al.2017], but have also referred an excellent guide, ”The Annotated Transformer ”777http://nlp.seas.harvard.edu/2018/04/03/attention.html.
While in our setting, we employ the Transformer as a context encoder to exchange information among multiple inputs with the following changes:
Transformer-1 (mentioned in Section 5 of the paper) looks up a sentence-level trainable position embedding table and add original token embeddings with associated token-positional embeddings, as the multi-headed self-attention mechanism in Transformer is position-agnostic.
Before feeding into Transformer-2 (mentioned in Section 5.2) and Transformer-3 (mentioned in Section 5.3), we add the input embeddings with trainable sentence-positional and role-indicator embeddings, respectively, to enable the awareness of specific encoding tasks.
Moreover, for the entity recognition part, we refer readers to [Huang et al.2015] for details about stacking the conditional random field (CRF) layer over encoded representations.
a.3 Hyper-parameters Setting
We summarize all hyper-parameters in Table 7 towards the reproducible research.
|Input Representation||the maximum sentence number|
|the maximum sentence length|
|(the embedding size)|
|Entity Recognition||the tagging scheme||BIO (Begin, Inside, Other)|
|the hidden size||(same to )|
|Transformer-1 Transformer-2 Transformer-3||the number of layers||4|
|the size of the hidden layer||(same to )|
|the size of the feed-forward layer|
|Optimization||the optimizer||Adam [Kingma and Ba2015]|
|the learning rate|
|the batch size||64 (with 32 GPUs)|
|the training epoch||100|
|the loss reduction type||sum|
the dropout probability
|the scheduled-sampling beginning||epoch|
|the scheduled-sampling ending||epoch|
a.4 Case Studies
In addition to the Equity Pledge example included by the paper, we show another three cases with comprehensive analyses in Figure 5, 6 and 7 for the Equity Overweight, Equity Underweight and Equity Freeze events, respectively, where we color the wrong predicted arguments as red and present detailed explanations.