Document-level entity-based extraction (EE) are tasks that extract entity-centric information, such as entities and their relations, from unstructured text across multiple sentences. With the rise of big data in recent years, document-level EE is growing in importance with applications such as understanding clinical reports nye2020understanding, extracting document-level events huang2021document
, and building knowledge graphs from journalswu2020extracting. In this work, we focus on two classic tasks of document-level EE: role-filler entity extraction (REE) and relation extraction (RE).
Motivated by these challenges, we propose to formulate REE and RE tasks as template generation. Due to the autoregressive nature of generative setup, this formulation makes dependencies among the output entities easier to capture compared to sequence tagging methods. Moreover, label names are incorporated into the decoder targets for exploiting label semantics not present in the extractive counterparts. Furthermore, for tasks that involve the identification of -ary relations, this formulation significantly alleviates the computational complexity of comparing exponential combinations of entities. A generative framework, Cross-attention Guided Template Generation (TempGen), that incorporates a novel copy mechanism into a pre-trained sequence-to-sequence model is proposed to solve the template generation problem effectively.
Our contributions can be summarized as follows:
We propose to formulate document-level EE tasks as a template generation problem, which allows our generative framework to effectively capture cross-entity dependencies, better identify entities with label semantics, and avoid the exponential computation complexity of identifying -ary relations.
We devise a novel copy mechanism based on cross-attention to enable our model to better learn how to copy key information from the input document.
Our approach achieves state-of-the-art results on MUC role-filler entity extraction task and SciREX relation extraction task, while being data efficient compared to previous systems.
This section gives an overview of the two document-level EE tasks we tackled in this work: role-filler entity extraction (REE) and relation extraction (RE).
2.1 Role-filler Entity Extraction
The REE task aims to extract all entities involved in events from the input article du-2020-grit. It differs from the event template extraction task introduced by the MUC-4 dataset muc-1992-message in that only one event template, as opposed to many, is outputted for each input document. For documents associating with multiple event templates (all events in MUC-4 are of Attack type), the event templates are collapsed as one — systems are required to identify all entities associate with different events for each role type. An event template consists of a set of pre-defined roles, and each role is filled with zero to many entities, as shown in Figure 2. An entity is characterized by a group of mentions, which are spans of text in the input document.
2.2 Relation Extraction
We focus on end-to-end document-level relation extraction where systems first extract entities from the input document and then identify the -ary non-typed relations among the extracted entities. The SciREX jain-etal-2020-scirex dataset is the only dataset that supports such end-to-end configurations that we know of. Thus, we follow the definition of document-level RE in SciREX, which contains binary and 4-ary relation annotation. A binary relation contains two typed entities, and a 4-ary relation contains four typed entities. An entity is represented by a cluster of mentions, similar to the REE task. Systems should first extract salient entities of pre-defined types222Salient entities are entities needed to describe the results of corresponding scientific article.. Then, binary and 4-ary relations among salient entities are identified. A binary relation example is shown in Figure 2.
3 Proposed Methods
In this section, we first illustrate how the REE and RE tasks can be framed as a template generation problem. This formulation then allows us to capture cross-entity dependencies easily with our proposed generative model, a pre-trained sequence-to-sequence model integrated with a copy mechanism.
3.1 Template Generation Formulation
We frame the REE and RE tasks as template generation problem, as shown in Figure 2. A template is composed of slot names and slot values. For both tasks, slot names are entity types, and slot values are all entity mentions corresponding to such entity types. Similar to previous works on REE huang-riloff-2011-peeling; du-cardie-2020-document; du-2020-grit, we only generate one template per document without differentiating which event template each entity mention associates with. In contrast, for RE, we generate multiple templates, each corresponding to a relation. A binary relation can be represented by a template of 2 slots, whereas a 4-ary relation forms a 4-slot template. A relation template consists of typed mentions of corresponding salient entities. After transforming REE and RE annotation to templates, each template can then be transformed into template sequences with special tags delimiting templates, slot names, and slot values.
Formally, a document of tokens may correspond to a decoding target of zero to many template sequences . A template sequence is characterized by multiple slot sequences ,
A slot sequence is represented by slot names and entities,
where is the slot name333Slot name corresponds to role in REE and entity type in RE. , and is the token sequence that correspond to one mention randomly sampled from entity . Special tokens, such as and , are to indicate whether a tag-enclosed string is a slot name or an entity mention. In the first row of the REE example from Figure 2, would be PerpInd and are “group of terrorists”. Using this formulation, scalability challenges of modeling cross-entity dependencies is alleviated due to the significantly reduced distances between entities in template sequences.
3.2 Cross-attention Guided Template Generation
The template generation problem can be broken down into two sub-goals: (1) generating valid template structures while capturing the dependencies between the input document and decoder targets, and (2) ensuring that salient mentions in the input document are correctly identified and outputted by the decoder. To achieve the first sub-goal, we leverage BART lewis-etal-2020-bart, a pre-trained sequence-to-sequence model. The second sub-goal is achieved using a novel copy mechanism incorporated into BART.
Seq2Seq Model for Template Generation
BART lewis-etal-2020-bart is a pre-trained language model that combines bidirectional and auto-regressive transformers. Pre-training with multiple denoising objectives, BART has demonstrated significant advantages in various text generation tasks, especially on summarization lewis-etal-2020-bart444We have considered the SOTA abstractive summarization LM, PEGASUS zhang2019pegasus. Yet, the GPU memory consumption is too high for us to test it.. The template generation problem much resembles summarization, except that generated template sequences contain implicit structures. With the various denoising pre-training objectives, we believe that BART can capture the implicit structure within template sequences, effectively model the dependencies among predicted entities, and produce rich semantics to reason over between slot names and entities.
Cross-attention guided copy mechanism
To enhance BART’s capabilities to identify salient mentions in the input documents, we incorporate a copy mechanism based on cross-attention. As cross-attentions often imply saliency of input tokens, a naive approach of computing copy distributions at time step over the input tokens is taking the mean of the last decoder layer’s cross-attention across all heads, as mentioned in xu-etal-2020-self,
where is the attention scores over input tokens at decoding step for head . and are the projection matrices for the encoder and the decoder. is the decoder hidden states at step , and denotes the encoder hidden states.
However, recent studies have shown that attention heads are not equally important, and that some heads can be pruned out with a marginal decrease in overall performance voita-etal-2019-analyzing; NEURIPS2019_2c601ad9
. We hypothesize that the attention probabilities produced by insignificant attention heads may be noisy. Thus, computing copy distributions without these heads could improve the model’s ability to infer the importance of each token in the input document. Motivated by this hypothesis, we proposeTopK Copy, a copy mechanism where only the Top- important attention heads are used for computing copy distributions. Consider the formulation of multi-head attention, following the notation from ashish-etal-2017-attention:
are the projection matrices for computing attention. is the matrix that allows interaction between different attention heads, where is the number of heads. To determine the importance of each attention head, we first transform to dimension (Equation 5), and then sum over the last two dimensions of (Equation 6),
where denotes the significance score for head . We take the attention heads with Top- highest significance scores in the last cross-attention layer, and use the mean of the attention probabilities outputted by these heads as the copy distribution as shown in equations 7 and 8,
The final probability of a word is a weighted sum of vocabulary distribution computed by BART and copy distribution ,
where is the generation probability computed by passing the dot product of the mean encoder hidden state and decoder hidden state at time step
through the sigmoid function,
Using the final probability distribution
, we can then compute the loss function as the average negative log likelihood of the target wordover all timesteps, following see-etal-2017-get,
4 Experimental Setup
|REE||Binary RE||4-ary RE|
4.1 Dataset and Evaluation Metric
Experiments are conducted on two English datasets: MUC-4 (muc-1992-message) for the role-filler entity extraction task and SciREX jain-etal-2020-scirex for the binary and 4-ary end-to-end relation extraction tasks. MUC-4 contains 1700 documents, with on average about 400 tokens per document. Documents are annotated with zero to multiple event templates. As per du-2020-grit’s pre-processing, we have a 13:2:2 split on the documents for train, development, and test, respectively. We evaluate the REE task on this dataset using the entity-level metric, CEAF-REE du-2020-grit. The metric aligns predicted entities with gold entities using Kuhn–Munkres algorithm kuhn1955hungarian; munkres1957algorithms, where a predicted entity is considered correct if and only if its corresponding mentions are a subset of the aligned gold entity’s mentions.
The SciREX dataset555https://github.com/allenai/SciREX consists of scientific articles, with entity, coreference, and relation annotations. With an average token count of about 5700, the articles are significantly longer than the documents in MUC-4. We use the pre-processed data from jain-etal-2020-scirex, which contains 306 documents for training, 66 for validation, and 66 for testing. In contrast to conventional relation extraction datasets, such as ACE05, relations are not typed in SciREX. Hence, the official SciREX evaluator jain-etal-2020-scirex only considers the correctness of predicted entities and entity types666There are 4 entity types: Material, Metric, Task, and Method. in each relation. Predicted entities are aligned with gold entities based on mention overlap. When the entities are aligned, predicted relations are aligned with gold relations accordingly. A predicted relation is correct if and only if both the associated entities and the entity types match the aligned gold relation.
We compare our method with the following competitive baseline systems.
NST du-cardie-2020-document builds multi-granularity representations on documents, and utilizes gate mechanism to fuse representations of different granularity.
TANL paolini2021structured augments sequential labels with input sentences, allowing it to be applied to various structured prediction tasks777Since the source code of TANL has not been released by the time we conducted experiments, we re-implemented it by closely following the method described in paolini2021structured..
GRIT du-2020-grit shares transformer parameters between the encoder and the pointer network decoder, and is the SOTA system for the REE task on the MUC dataset.
DyGIE++ wadden-etal-2019-entity is a span-based multi-task IE framework jointly trained on relation extraction, named entity recognition, and coreference resolution.
is a span-based multi-task IE framework jointly trained on relation extraction, named entity recognition, and coreference resolution.
SciREX-P jain-etal-2020-scirex is the SOTA framework for end-to-end binary and 4-ary relation extraction on SciREX. The pipeline is composed of 4 components: mention identification, mention clustering, salient entity cluster identification, and relation classification.
In terms of the pre-trained language models used, BERT-base Devlin_2019 is used for NST, DyGIE++, and GRIT. SciREX-P is fine-tuned on SciBERT beltagy-etal-2019-scibert. We replace T5 JMLR:v21:20-074 with BART-base for TANL for a fair comparison with our method.
4.3 Implementation details
The proposed models are optimized using AdamW loshchilov2018decoupled with learning rate 5e-5 and weight decay 1e-5. We used grid search to find the best for TopK Copy and found that yields the best overall performance across REE and RE. The maximum input sequence length for RE and REE are 1024 and 512, respectively. During inference time, all generative models used beam search with a beam width of 4.
5 Results and Analysis
5.1 Main Results
Table 1 summarizes the main results on role-filler entity extraction, binary, and 4-ary relation extraction. TempGen establishes new state-of-the-art scores on all three tasks, outperforming the previous best models by an absolute F1 of 3.26%, 4.8%, and 2.7%. The improvements demonstrate the effectiveness of our approach in formulating document-level EE tasks into template generation tasks. Although TempGen scores the highest F1 across all three tasks, SciREX-P does achieve the highest recall on both RE tasks. This can be explained by the fact that our model can only encode the first 1024 sub-tokens of each SciREX document, which is merely 17% of the average sub-token count per document. This makes it challenging for TempGen to identify relations that lie in the latter 83% of each document. In the future, we can extend BART’s positional embedding matrix to enable TempGen to encode longer documents. Additionally, we set the maximum input sequence length to 512 for TempGen for fairer comparisons with SciREX-P. We obtain F1 scores of 11.94% and 2.18% on binary and 4-ary relation extraction, respectively. This confirms the advantage of our model on the relation extraction tasks.
While TANL performs worse than our model on REE, it is still able to achieve a higher score than GRIT. This suggests that augmenting decoding targets with label names provides useful semantics, whereas adding input documents to decoding targets may not yield better results in the REE task. We also observe that TANL scores extremely low on both RE tasks, where 58% of the binary relations and 26 % of the 4-ary relations in the decoding targets are filtered out due to exceeding maximum sequence length of BART. Out of the remaining relations, 57% of the binary relations and 78% of the 4-ary relations have at least one entity removed in the decoding targets due to its long distance from the first-appearing entity888Please refer to Appendix B for more details., suggesting that TANL’s poor performance on RE tasks is due to scarcity of gold labels. This reflects that TANL is ill-suited for document-level EE tasks.
|Model||REE||Binary RE||4-ary RE|
|w/o TopK Copy||55.76||12.63||3.00|
|numeric slot name||56.31||8.22||0.85|
We observe extremely low performances across all systems on both tasks of SciREX, even though TempGen outperforms the baseline systems significantly. This is mainly caused by the characteristics of the SciREX dataset. First, syntactic characteristics specific to scientific journals, such as algorithm blocks, result in the unusually long sequences in the SciREX dataset despite best parsing efforts. Additionally, another feature frequently seen in scientific journals is the use of table and figure captions. Since captions are not included as part of the input text, the number of accepted relations decreases drastically.
5.2 Performance Analysis
We conducted ablation studies by replacing the TopK Copy module with other copy mechanisms. Naive Copy refers to computing copy distributions with the attentions from all cross-attention heads. SAGCopy xu-etal-2020-self utilizes encoder self-attention to compute centrality scores for measuring the saliency of each input token. As shown in Table 2, we found that Naive Copy leads to performance drop on all three tasks, especially on binary and 4-ary relation extractions. Naive Copy achieving scores even lower than fine-tuning BART alone (i.e. w/o TopK Copy ) reflects that copy mechanisms may mislead models to copy incorrect input tokens. A qualitative example of the difference between TopK Copy and Naive Copy demonstrated in Figure 3 validates our hypothesis. Quantitatively, examining MUC-4 test set predictions, there are 79 cases where TopK Copy corrects the misguidance of Naive Copy, while only 32 cases where new errors are introduced by TopK Copy. For both REE and RE, adding SAGCopy leads to performance drop, suggesting that the centrality scores of input tokens may not be an ideal feature for these tasks.
We also experimented with replacing the original slot names with numeric slot names (i.e. converting PerpInd to <ROLE_1>, PerpOrg to <ROLE_2>, and etc). This conversion removes the semantics of slot names in the decoding targets. While little performance drop was observed on the REE task, using numeric slot names resulted in the worst performance on binary and 4-ary relation extraction tasks, which could be a result of strong slot dependencies in RE in comparison with REE. In RE, slots are directly semantically related to other slots in each template whereas slots in REE are relatively independent. This shows that slot name semantics are useful for template generation tasks with strong slot dependencies in each template. Finally, we conducted ablation studies on different variations of templates as decoding targets. Specifically, three variations are tested on the REE task: (1) We merge entities of the same role names into the same “slot”. (e.g. transforming the decoding targets from “ PerpInd Alice PerpInd Bob ” to “ PerpInd Alice; Bob ”). (2) Based on (1), all slot names, such as “PerpInd” and “PerpOrg”, are replaced with the same special token “<ROLE>”. (3) We use the same decoding targets as GRIT’s. These three settings achieve test set F1 scores of 56.65, 54.16, and 52.55, respectively. The results suggest that differentiating entities with different entity types helps improve the performance. Furthermore, comparing with the results in Table 1, we found that GRIT performs better than our system, reflecting that a pointer network-based model, which has with smaller search space than ours, is more advantageous when using the same decoding targets.
Impact of the Amount of Training Data
To test the data efficiency of our approach, we compared TempGen and TempGen - TopK Copy with GRIT on the REE task using different amount of MUC training data. As seen in Figure 5, both TempGen and TempGen - TopK Copy outperform GRIT across all settings with a slightly larger performance margin in low resource settings. This indicates that our approach is more data-efficient compared to the previous SOTA system on REE.
Impact of Cross-attention Heads
Figure 4 shows our model’s change in performance conditioned on various values of in the TopK Copy mechanism. Consistent with our results in Section 5.1, we see that removing some of the cross-attention heads (12 10) can lead to performance gain due to the filtered noise brought by unimportant attention heads. However, we noticed a drop in performance across all three tasks for lower values of , suggesting that beneficial cross-attention heads are removed. Interestingly, performance drops immediately as decreases below 10, suggesting that only a small portion of the cross-attention heads are unimportant. The trend is consistent with NEURIPS2019_2c601ad9’s results where pruning cross-attention heads to a certain extent can easily result in performance drop. Additionally, the model with no copy mechanism () outperforms the model with few attention heads (), suggesting that the copy distributions obtained from not sufficiently informative cross-attentions can mislead the model.
5.3 Qualitative Analysis
The following qualitative analysis provides intuition for our model’s ability to capture dependencies across entities and utilize slot name semantics.
To validate our approach’s capability to capture cross-entity dependencies, we considered binary relations on SciREX where at least one of the associated entities is involved in multiple relations. The dependencies among entities are better captured by the model that predicts fewer unlikely relations. Comparing the test set outputs of TempGen and SciREX-P, we see that 13131 errors made by SciREX-P are corrected by our model, which only introduces 604 errors. This result demonstrates the strength of TempGen in modeling cross-entity dependencies.
Importance of Label Semantics
Comparing the test set predictions between TempGen and GRIT on the MUC-4 REE task, we see that our approach better distinguishes confusing entities such as Victim and Target entities. As shown in the example in Figure 6, GRIT incorrectly predicts the two victims, “Miguel Soler Rodrigues” and “Martha Luz Lopez”, as Target entities. It also misidentifies “El Espectador”, a newspaper company, as a victim of the attack. In contrast, TempGen is able to recognize the roles of the two victims. Even though it’s not an exact match, the predicted Target entity had a correctly identified role type with similar semantic meaning compared to the gold label.
6 Inference time comparison
As discussed earlier, TempGen can significantly reduce the exponential computational complexity of document-level N-ary relation identification. To illustrate this, we compared the inference time between TempGen and two other systems, TANL and SciREX-P, on the SciREX 4-ary RE task. As shown in Figure 7, TempGen drastically shortens the inference time by around 39 times compared to SciREX-P. TANL also runs much faster than SciREX-P, but is still around 4 times slower than TempGen. This is resulted from the fact that TANL generates the entire input document in addition to entity and relation labels, which is much longer than TempGen’s generated sequences.
7 Number of Parameters
Figure 8 shows the number of parameters of different models. GRIT, with the same size of BERT-base, has the least number of parameters among all models. DyGIE++ and SciREX-P have slightly more parameters than GRIT due to the additional linear layers for constructing classifiers. The two generative models, TANL and TempGen, have the most parameters, thanks to the larger vocab size (30522 50265), larger positional embedding matrix (512 1024), and cross-attention modules in BART-base.
8 Related Works
In the following sections, we will first discuss a few important works on the REE task and document-level RE task. Then, we will dive into a few works that uses a similar sequence generative approach to various document-level IE tasks.
8.1 Role-filler Entity Extraction
Document-level REE has been explored in recent works using a variety of model architectures. du2020document formulates the task as a sequence tagging problem, and trains layered classifiers as sequence readers on multiple granularities. In contrast, GRIT du-2020-grit formulates the problem as sequence generation, and employs a single transformer layer whose parameters are shared between encoder and decoder to enrich semantics in the shared parameters. A pointer selection network is used for the final layer of decoding.
8.2 Document-level Relation Extraction
Due to long-term dependencies that often span over hundreds of tokens, capturing entity relations have proven to be a challenging task. One approach was constructing a document-level graph from sentence encoding, then extracting entity relations from edge representations in the graph christopoulou-etal-2019-connecting. Other works such as jia-etal-2019-document layer classifiers in a pipeline architecture to obtain hierarchical representation of -ary relations.
8.3 IE as Sequence Generation
Recently, there has been an increasing number of works framing information extraction tasks as sequence generation problem. zeng-etal-2018-extracting formed triple extraction as a sequence generation task and adopted a RNN-based model with copy mechanisms. To encourage the faithfullness of the extracted triplets, ye2020contrastive designed a triplet contrastive training objective. These works focus on sentence-level triplet extraction, while our work extracts role-filler entities and entity relations at the document level. li2021documentlevel; hsu2021event formulates the document-level event argument extraction task as a conditional generation problem by providing event ontology. However, their work cannot be applied to REE or RE due to the lack of ontology for role-filler entities and relations. du-2020-grit relied on a pointer-network-based decoder 10.5555/2969442.2969540 to extract event role-filler entities, and the parameters of BERT Devlin_2019 is shared between the encoder and the decoder. Nevertheless, their method cannot incorporate role labels, whereas our approach can take advantage of the label semantics.
paolini2021structured uses a very similar generative approach, which constructs decoder targets by inserting text markers and labels around entity mentions in the input sentence. The key idea is that augmenting the decoder targets with original input sentence and labels provides stronger semantics to the model. Unfortunately, modeling cross-entity dependencies remains a challenge as entities are further apart in their decoding targets. We instead transform annotations into template sequences as decoding targets, where distances between entities are significantly shortened. Thus, our approach alleviates the scalability challenge of capturing cross-entity dependencies at the scale of documents. Additionally, our approach differs in that the length of our decoder targets is significantly shorter, allowing the non-truncated decoder targets to fit in pre-trained language models. In contrast, for their method, the gold decoder targets are guaranteed to be longer than corresponding input document. Since the length of input tokens are often greater than the max sequence length of pre-trained language models for document-level EE, a great portion of the gold labels will be skipped using paolini2021structured’s method.
We have proposed TempGen, a framework that frames document-level REE and RE tasks as a template generation task. A copy mechanism that takes the top- important cross-attentions as copy distributions is incorporated into BART for capturing key information in the input document. Experimental results on MUC-4 and SciREX showed that TempGen outperforms prior approaches on role-filler entity extraction and end-to-end document-level relation extraction tasks. Under different amount of training data, TempGen demonstrates robustness across all settings, while being advantageous in lower-resource regime.
We appreciate insightful feedback from PLUSLab members and the anonymous reviewers. This research was sponsored by the Intelligence Advanced Research Projects Activity (IARPA), via Contract No. 2019-19051600007. The views and conclusions of this paper are those of the authors and do not reflect the official policy or position of IARPA or the US government.
Appendix A REE Performance Breakdown
Table 3 demonstrates the per-role performance comparison between TempGen and other baselines. We observe that:
TempGen achieves the best precision across all roles.
Except for PerpInd, TempGen obtain substantial improvement in F1 over other baselines.
While TempGen has higher precision over GRIT in extracting PerpInd entities, it scores slightly lower in recall, leading to worse F1 performance.
Appendix B TANL Decoding Target Formulation
In this section, we illustrate how we formulate the TANL paolini2021structured decoding targets for REE and RE. The formulation for REE is simple due to its similarity to the NER task. We produce REE decoding targets exactly the same way as how NER decoding targets are formed in paolini2021structured. Given the REE example in Figure 2, the corresponding TANL decoding target is:
Two U.S. mormon missionaries – aged 19 and 21 – were shot to death last night by [a group of terrorists| PerpInd] from the [Zarate Wilka Armed Forces of Liberation| PerpOrg] (FAL) … blew up the [lines| Target] providing power to La Paz, … the U.S. citizens – [Todd Ray Wilson Burdenson| Victim] and Jeffrey Brent Ball – … they were killed with two bursts of [machinegun| Weapon] fire …
As for RE, we follow how paolini2021structured handles nested entities and multiple relations, but we made a small modification on decoding targets. Since SciREX does not contain relation type annotation, we use the related entities’ types as the relation type in the decoding targets. With their formulation, the decoding target is created by inserting each relation annotation around the first-appearing entity in the input document. Take the RE instance in Figure 2 as an example. The corresponding TANL decoding target would be:
Introduction: [Natural language inference | Task | Method
= aESIM] (NLI) is an important andsignificant task in natural language processing (NLP)…
Method: We … denote the modified ESIM as aESIM …
Experiments: The [accuracy | Metric | Material = Quora] (ACC) of each method is measured by thecommonly used precision score … It also achieved 88.01 % on Quora …
Appendix C Hardware and Software configurations
Appendix D Implementation Details
We conducted grid search to find the best learning rate over using TempGen w/o TopK Copy on the MUC-4 REE task. The best learning rate, , is fixed for all other experiments. Models are trained for 150 epochs for REE and binary RE experiments, and 50 epochs for 4-ary RE experiments. To reproduce our results, please follow the README.md file in
, is fixed for all other experiments. Models are trained for 150 epochs for REE and binary RE experiments, and 50 epochs for 4-ary RE experiments. To reproduce our results, please follow the README.md file inhttps://github.com/PlusLabNLP/TempGen. The weights of the trained models are also included for reproduction purposes.