DeepAI
Log In Sign Up

Towards Effective Multi-Task Interaction for Entity-Relation Extraction: A Unified Framework with Selection Recurrent Network

02/15/2022
by   An Wang, et al.
0

Entity-relation extraction aims to jointly solve named entity recognition (NER) and relation extraction (RE). Recent approaches use either one-way sequential information propagation in a pipeline manner or two-way implicit interaction with a shared encoder. However, they still suffer from poor information interaction due to the gap between the different task forms of NER and RE, raising a controversial question whether RE is really beneficial to NER. Motivated by this, we propose a novel and unified cascade framework that combines the advantages of both sequential information propagation and implicit interaction. Meanwhile, it eliminates the gap between the two tasks by reformulating entity-relation extraction as unified span-extraction tasks. Specifically, we propose a selection recurrent network as a shared encoder to encode task-specific independent and shared representations and design two sequential information propagation strategies to realize the sequential information flow between NER and RE. Extensive experiments demonstrate that our approaches can achieve state-of-the-art results on two common benchmarks, ACE05 and SciERC, and effectively model the multi-task interaction, which realizes significant mutual benefits of NER and RE.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/15/2020

Deeper Task-Specificity Improves Joint Entity and Relation Extraction

Multi-task learning (MTL) is an effective method for learning related ta...
08/27/2021

A Partition Filter Network for Joint Entity and Relation Extraction

In joint entity and relation extraction, existing work either sequential...
05/04/2022

Modeling Task Interactions in Document-Level Joint Entity and Relation Extraction

We target on the document-level relation extraction in an end-to-end set...
01/31/2016

Numerical Atrribute Extraction from Clinical Texts

This paper describes about information extraction system, which is an ex...
12/03/2022

Named Entity and Relation Extraction with Multi-Modal Retrieval

Multi-modal named entity recognition (NER) and relation extraction (RE) ...
05/01/2020

Recurrent Interaction Network for Jointly Extracting Entities and Classifying Relations

Named entity recognition (NER) and Relation extraction (RE) are two fund...
09/13/2021

Pack Together: Entity and Relation Extraction with Levitated Marker

Named Entity Recognition (NER) and Relation Extraction (RE) are the core...

1 Introduction

Entity-relation extraction is a fundamental problem in Information Extraction (IE). It aims to both identify named entities and extract relations between them. This problem can be decomposed into two sub-tasks: named entity recognition (NER) Sang and De Meulder (2003); Zhang et al. (2004); Ratinov and Roth (2009) and relation extraction (RE) Zelenko et al. (2003); Bunescu and Mooney (2005). Figure 1 shows an example of entity-relation extraction.

Early works Sang and De Meulder (2003); Florian et al. (2004) typically adopt a pipeline framework by solving each sub-task with sequential modules. They first recognize all the entities in a sentence and then performs relation classification for each entity pair. These traditional pipeline methods ignore the interaction between NER and RE.

Some recent works adopt a cascading pipeline Yu et al. (2020); Wei et al. (2020) which can develop a unified label space by reformulating NER and RE as closer sub-tasks: first extract subject entities, then extract corresponding object entities. A fatal shortcoming of these methods is that they cannot deal with out-of-triple entities (examples shown in Figure 1) which do not appear in any relation.

Current approaches aim to jointly solve the two sub-tasks to take advantage of the benefits of their inter-task correlation. Specifically, some of them cast NER and RE as a joint table filling problem Miwa and Sasaki (2014); Gupta et al. (2016); Zhang et al. (2017); Tran and Kavuluru (2019); Wang and Lu (2020); Wang et al. (2020b, 2021). This allows NER and RE to be performed in one stage through implicit multi-task interaction, realized by shared feature space. However, these works ignore the natural sequential information propagation between NER and RE.

Moreover, PURE Zhong and Chen (2021) revisits pipeline methods and realizes the sequential information flow between NER and RE through inserting typed entity markers into the input text of the RE model, which improves the performance of RE. They empirically find that using a shared encoder to model implicit interaction does not improve but rather impairs the performance of NER. We attribute the reason to the fact that they did not consider the contradictory information in the shared features caused by the natural gap between NER and RE. In contrast, Yan et al. (2021) considers this problem and proposes a partition filter network (PFN) to improve the joint table-filling methods by separating task-specific independent information and shared information from the shared representations. They claim the opposite finding that through implicit interaction, RE contributes non-negligibly to NER. However, they still find that RE disturbs the extraction of out-of-triple entities which in essence do not correspond to any relations, resulting in the overall lower NER results compared with Zhong and Chen (2021). Motivated by these findings, we seek to explore more effective multi-task interaction to alleviate the potential negative impacts of RE and make RE really beneficial to NER.

In this study, we propose a novel and unified cascade framework which eliminates the gap between different task forms. We treat entity-relation extraction as a multi-turn span extraction problem includes entity extraction, subject extraction and subject-oriented object extraction. We employ a separate task model for each turn. In this way, we regard NER as entity extraction and decompose RE into subject extraction and subject-oriented object extraction. For each task, we employ a BERT-based embedder Devlin et al. (2019) and a shared multi-task encoder to obtain token representations in the sentence. Then we feed these representations into separate task models to perform span extraction.

Towards effective multi-task interaction between NER and RE, we consider two kinds of interactions: implicit interaction and sequential information propagation. During modeling implicit interaction, to avoid irrelevant information hurting model performance, we propose a selection recurrent network (SRN) encoder to learn task-specific independent and shared representation for three tasks. For sequential information propagation, we aim to maintain the natural information flow between NER and RE by incorporating the output information from NER into the input of RE. To this end, we propose two strategies of fusing such prior information in our three-task framework: early fusion and late fusion strategies. early fusion strategy fuses the prior information into the subsequent task models by inserting subject and entity type markers into the input sentence but leads a a large computational cost due to separate text encoding. While late fusion strategy solves this issue by reusing the same encoded representation for all the tasks and propagating the prior information as an intermediate entity/subject embedding.

Figure 1: Examples of entity-relation extraction. primal-dual algorithm and motion segments are in-triple entities connected in USED-FOR relation. classical motion segmentation and optical flow are out-of-triple entities that do not correspond to any relations.

We evaluate our method on two public datasets: ACE05Walker et al. (2006) and SciERCLuan et al. (2018). Experimental results demonstrate that our method outperforms previous state-of-the-art models on both benchmarks. In particular, it especially outperforms PURE Zhong and Chen (2021) on RE by 1.5%/3.1% and outperforms PFN Yan et al. (2021) on NER by 0.8%/2.4% on ACE05/SciERC respectively. We also perform detailed analysis experiments to verify whether our framework can really achieve the mutual benefits of NER and RE. We observe that maintaining sequential inforamtion flow is especially crucial to the performance of both NER and RE. Moreover, compared with PFN Yan et al. (2021) which also consider implicit interaction, our method achieves remarkable improvement on NER, particularly for out-of-triple entity extraction. This also confirms that our sequential information propagation with the unified task form can eliminate the negative impacts RE imposes on NER. The above observations prove the superiority of our framework in the effective interaction of NER and RE.

2 Related Work

Entity-relation extraction consists of two sub-tasks: named entity recognition (NER) and relation extraction (RE). Traditionally, early pipeline methods Sang and De Meulder (2003); Florian et al. (2004); Chan and Roth (2011) address it in two separate steps where NER and RE models are trained separately, which ignores the interaction between NER and RE. Recently, Zhong and Chen (2021) proposed to fuse entity information to the RE model, realizing the sequential interaction between NER and RE. Their result reflects the direct benefits of NER to RE, which is reasonable because NER reduces the learning burden of the RE model. However, their method does not consider the potential benefits that RE may contribute to NER.

Another branch of works propose cascade methods Yu et al. (2020); Wei et al. (2020). These methods decompose relation triple extraction into subject entity extraction and object entity extraction, both formulated as span extraction. This leads to a unified label space. However, these works cannot deal with out-of-triple entity extraction. Our framework is partially motivated by this task setting, but we additionally incorporate an entity extraction model to enable the extraction of out-of-triple entities.

Both pipeline methods and decomposition-based methods suffer from error propagation and exposure bias problems. To address this, recent works seek to jointly solve the extraction of entities and relations. Some of them Miwa and Sasaki (2014); Gupta et al. (2016); Zhang et al. (2017); Tran and Kavuluru (2019); Wang and Lu (2020); Wang et al. (2020b, 2021) cast NER and RE as a table filling problem. They attempt to capture the implicit interaction between NER and RE by sharing the same encoder and weights for simultaneous prediction of entities and relations. However, they do not consider the possible contradiction in the shared information between entity and relation extraction due to the different task forms, which may harm the performance of individual tasks. Yan et al. (2021) alleviated this issue with a partition filter encoder to model two-way interaction between NER and RE. The encoder partitions the shared representation into two task-specific representations for NER and RE along with one shared representation. However, the partition operation only applies to the two-task situation, which limits the design of more sub-tasks. In contrast, we propose a selection recurrent network which can handle the case of more than two sub-tasks and work well for our proposed task formulation.

In our work, we successfully combine the advantages of both implicit joint interaction and sequential interaction while alleviating their weaknesses.

3 Methodology

3.1 Problem Formulation

Given a sentence consisting of tokens , the first aim is named entity recognition (NER), to extract all the entities from . An entity is denoted as where and are the start and end token of an entity and denotes the entity type of data . The second objective is relation extraction (RE), to identify relations between the recognized entities. We use to represent a relation triple, which means entities and have the relation of relation type . We regard as the subject and as the object in a relation triple.

Because some relation triples share the same entity as subjects, we model relations as functions that map subjects to objects given a specific relation type , instead of performing relation classification directly. Therefore, our framework reformulates NER and RE as a multi-turn span extraction problem, including the following turns: entity extraction (EE), subject extraction (SE) and subject-oriented object extraction (SOE). We solve these three sub-tasks in a cascading manner. After obtaining subjects via SE and their corresponding objects through SOE, we transform them into relation triples .

3.2 Overview of the Framework

We propose a unified cascade entity-relation framework which eliminates the gap between different task forms by using unified span extraction. It consists of three modules for EE, SE, and SOE. In all the modules, we first employ the same pre-trained BERT Devlin et al. (2019) encoder to obtain the contextual representations of the tokens in the sentence. Then we propose an SRN (Selection Recurrent Network) encoder which models implicit interaction and learns how to separate task-specific independent information and inter-task shared information in the token representations. We hence feed these task-specific representations from SRN to three span extraction models to obtain entity spans, subject spans along with corresponding object spans, respectively.

To maintain the sequential information flow between the sub-tasks, we also design two strategies for fusing the output information predicted by precedent span extraction modules into subsequent modules. To be more specific, when performing subject extraction, we make the module aware of the entity information extracted in EE. During SOE, the module is fed with both the entity information and subject information. We describe the details of sequential information propagation strategies in Section 3.4. The overview of our framework with two strategies are illustrated in Figure 3.

3.3 Selection Recurrent Network

Figure 2: The architecture of SRN. It is an LSTM-like network that encodes sequence information with cell states and hidden states which store intermediate information. At each time step, we perform selection operation to select task-relevant information from the intermediate information and calculate task-specific and task-shared representations as the inputs of task-specific modules.

We first introduce our proposed selection recurrent network (SRN) which is essential to learning the implicit interaction among multi-tasks. SRN is a novel RNN-based feature encoder. As shown in figure 2, this network is designed to jointly extract task-specific independent information and shared information from token representations. Similar to the standard LSTM, for each time step , in addition to a hidden state , we use an additional cell state to store intermediate memories. We use a forget gate to control the forgetting operation for history cell state , and use an output gate to generate hidden state from the current cell state . We calculate the current state and filtered history state with two gates as follows:

(1)
(2)

where is the input token representation for the current time step and

indicates sigmoid function. After obtaining

and , we perform selection operations to generate three different filtered memories which store intra-task information and four different shared memories that store inter-task information.

Firstly, we calculate three selection master gates , ,

for three sub-tasks. Master gates will select task-related neurons from state representations. An element of these three gates means whether the corresponding neuron belongs to the specific task. Similarly to computing forget gates and output gates, we generate the three master gates from history hidden state

and input .

(3)

Then, based on the master gates, we calculate four shared gates , , , and three independent task gates , , . For example, represents shared gate for entity extraction and object extraction, while represents shared gate for all tasks Each element in the four shared gates indicates whether the corresponding neuron is shared by multiple tasks, while each element of these independent gates means whether the corresponding neuron only belongs to the specific task.

(4)
(5)

As illustrated in Figure 2, we also compute the master gates, shared gates and independent task gates for filtered history state in exactly the same procedure as Equation (3) (4) (5). Hence we obtain , , , , , , .

Then, we compute the independent memories and shared memories to store the task-specific and shared information in the token representation to achieve flexible interaction among three sub-tasks.

(6)
(7)

Information stored in independent memories is inaccessible for other tasks. Instead, information stored in shared memories can be leveraged by different tasks. For example, stores the shared information between entity extraction and subject extraction. stores the shared information of all three tasks.

Combining filterer memories and shared memories, we can update cell state and hidden state for next time step:

(8)
(9)

Meanwhile, we generate task-specific independent token representations , , and shared token representations , , and from the corresponding memories:

(10)

3.4 Sequential Information Propagation

(a) Early fusion
(b) Late fusion
Figure 3: An overview of our framework with early fusion and late fusion strategies. Different color indicators represent different processes in the sequential framework. During inference, prior entity information and prior subject information for subject and object models is obtained from the output of entity models and subject models as shown in this figure. However, during training, we directly use ground-truth entity label information as the prior information as described in section C. In figure (a), “text” is the original input sentence, “text*” is text with inserted entity markers, “text**” is text with entity markers and subject markers. In figure (b), For entity, subject, object models, they share the same text representation for “text” encoded by BERT and SRN.

To maintain the sequential information flow between different tasks, we design two strategies to fuse the entity information in previous modules, namely, early fusion and late fusion.

3.4.1 Early Fusion

Inspired by PURE Zhong and Chen (2021), for the early fusion method, we insert pre-defined text markers into the input text right before and after the entities or given subject. We define text markers as [S:S] and [S:E] for the start and end of a subject separately, then insert them before and after the subject span to highlight the subject. By inserting pre-defined subject text markers, the model will not be confused by multiple potential subjects in the sentence. To highlight the entity with type , we define text markers as [_S] and [_E] for the start and end of an entity span separately, then insert them before and after the entity span. With these text markers inserted in the input text, each of the span-extraction modules can easily understand the entity/subject information predicted by previous modules. As shown in Figure 3 (a), the input sentences are updated with new text markers following the sequential prediction of the three task modules. For each module, only the task-specific token representation is obtained from the SRN encoder. For example, for entity extraction, token has one independent representation and three shared representation , , . After concatenating these representation, we get the final task-specific representation . In the same way, we get , . We then perform span extraction on these representations to obtain entity spans, subject spans, and object spans, respectively.

3.4.2 Late Fusion

The weakness of the early fusion method is that we need to encode the sentence every time for each turn of subject extraction and object extraction because each time the input sentence is modified by inserting different text markers. To reduce the time expense, we propose an alternative late fusion strategy. The key idea is that we would like to reuse the outputs of the encoder to save the repeating labor of text embedding and encoding. As shown in Figure 3 (b), towards this end, we assign each token a subject tag to denote whether this token belongs to a subject span and meanwhile assign an entity tag to denote whether this token belongs to a certain type of entity or does not belong to any entity span. We construct the entity-type embedding table as , where is the dimension of the entity type embedding. We construct the subject embedding table as . Next, given the information of entity and subject prediction, we can look up the two embedding tables to obtain corresponding entity embedding and subject embedding for each token . We concatenate the task-specific representations from the SRN encoder with the entity embeddings as the input of the subject extraction module. For object extraction, we concatenate SRN outputs with both entity embeddings and subject embeddings.

(11)

Finally, we will perform span extraction given these input representations to obtain entity/subject/object spans.

3.5 Span Extraction Model

Entity span extraction, subject span extraction, and object span extraction are all performed by using a span extraction model with a unified architecture inspired by the span selection in machine reading comprehension Li et al. (2020)

. Specifically, we first adopt two kinds of identical binary sequence classifiers to detect the start and end position of entities/subjects/objects respectively by assigning each token a binary tag (0/1) that indicates whether the current token corresponds to a start or end position of a target span. Then, we use a token-pair classifier to match the detected start and end positions as a span.

4 Experiments

4.1 Experiment Settings

4.1.1 Datasets

We evaluate our methods on two standard entity-relation extraction benchmarks: ACE05 Walker et al. (2006) and SciERC Luan et al. (2018), ACE05 is collected from various domains, including newswire and online forums. SciERC is collected from AI paper abstracts and contains annotations of scientific entities and their relations.

4.1.2 Evaluation metrics

We evaluate our models under strict evaluation protocol following previous work Zhong and Chen (2021); Yan et al. (2021)

and use micro F1 measure as the evaluation metric. For NER, a predicted entity is considered as a correct prediction if its span boundaries and the predicted entity type are both correct. For RE, a predicted relation is considered as correct only if the boundaries and entity type of both subject and object spans are correct and the predicted relation type is correct.

4.2 Main Results

Method Embedder ACE05 SciERC
NER RE NER RE

Structured Perceptron

Li and Ji (2014)
- 80.8 49.5 - -
SPTree Miwa and Bansal (2016) Word2vec 83.4 55.6 - -
Multi-turn QA Li et al. (2019) Bert-large 84.8 60.2 - -
SPEWang et al. (2020a) SPE 87.2 63.2 - -
SciBERT - - 68.0 34.6
Table-Sequence Wang and Lu (2020) ALBERT 89.5 64.3 - -
PURE Zhong and Chen (2021) ALBERT 89.7 65.6 - -
SciBERT - - 66.6 35.6
PFN Yan et al. (2021) ALBERT 89.0 66.8 - -
SciBERT - - 66.8 38.4
Cascade-SRN (Late fusion) ALBERT 89.4 65.9 - -
SciBERT - - 68.6 36.7
Cascade-SRN (Early fusion) ALBERT 89.8 67.1 - -
SciBERT - - 69.2 38.7
Table 1: Main results (%): F1 scores on test splits of ACE05 and SciERC. Results of PURE are reported in single-sentence setting for fair comparison.

4.2.1 Results and analysis

Table 1 shows the comparison of our methods with previous results. We report the F1 scores in single-sentence setting. As it is shown, the proposed method Cascade-SRN using the late fusion strategy shows competitive performance for NER and RE compared to previous methods. While the proposed method Cascade-SRN using the early fusion strategy outperforms all the baselines. We state that early fusion strategy achieves a better F1 score because it makes bert-based embedder and SRN encoder aware of the prior information so that it has a larger learning space. Because the method with early fusion achieves higher performance, we regard it as our default model. Compared with the previous best pipeline method PURE Zhong and Chen (2021), our model outperforms by 3.1% on SciERC for RE and 1.5% on ACE05, while obtaining 2.6% boost of NER F1 on SciERC but only 0.1% NER F1 on ACE05. This shows that compared with the pipeline model with a separate encoder, our method mainly brings the improvement of RE performance. Compared with the previous best joint method PFN Yan et al. (2021), our model outperforms by 0.8% on ACE05 and 2.6% on SciERC for NER, while we obtain 0.3% RE F1 on ACE05 and SciERC.

Combining these observations, we find that our method achieve more balanced performance on NER and RE compared with previous pipeline and joint methods, Because previous pipeline methods do not model the mutual interaction between NER and RE, they do not leverage NER information to help RE prediction. Meanwhile, the previous joint method cannot deal with the dynamic between in-triple and out-of triple entity prediction, so importing RE information sometimes even hurts the final NER score. Our methods averts these two drawbacks and successfully makes RE information beneficial to entity prediction.

4.3 Ablation Study

Ablation Settings NER RE
Encoder SRN 69.2 38.7
BiSRN 69.1 36.8
BiLSTM 68.2 36.5
Text Marker Entity+Subject 69.2 38.7
Subject 66.2 35.6
Training strategy golden entity 69.2 38.7
golden subject 68.1 33.7
Joint Vs. Independent Joint 69.2 38.7
Independent 67.5 -
Table 2: Ablation study results.
Model SciERC ACE05
In-triple Out-of-triple In-triple Out-of-triple
P R F1 P R F1 P R F1 P R F1
PFN Yan et al. (2021) 78.0 71.1 74.4 38.9 61.7 47.8 95.9 92.1 94.0 85.8 86.9 86.3
Cascade-SRN (late fusion) 82.1 71.7 76.5 39.5 66.2 49.5 95.7 91.9 93.8 86.2 87.8 87.0
Cascade-SRN (early fusion) 82.0 70.3 75.7 44.5 61.5 51.6 95.4 92.3 93.8 86.5 88.1 87.3
Table 3: Entity prediction results on different entity groups. Entities are split into two groups: In-triple and Out-of-triple based on whether they appear in relation triples or not.

In this section, we perform ablation studies on the SciERC dataset to analyze the effects of our method with respect to different aspects: Encoder, Text Marker, Training Strategy, and Joint Vs. Independent. Table 2 reports the ablation study results.

4.3.1 Encoder

We replace our SRN encoder with Bidirectional SRN or BiLSTM. Our SRN encoder is a unidirectional encoder which only in the forward direction. Bidirectional SRN includes two SRN encoders, one in the forward direction and the other one in the backward direction. We observe replacing unidirectional SRN with Bidirectional SRN cannot bring performance boost. When we replace SRN with BiLSTM, the experiment results show an obvious drop in NER and RE performance. This proves the effectiveness of the SRN model and the importance of filtering irrelevant information in the shared representation.

4.3.2 Text Marker

In our early fusion strategy for sequential information propagation, we insert text markers into the input sentence of each span extraction module. Here we aim to validate the effects of entity markers which maintain the sequential information flow between NER and RE. We compare two settings: Entity+Subject Marker and Subject Marker. In the former setting, we insert entity type markers into the input sentences of the subject module and for the input of the object module, both entity markers and subject markers are inserted. This is the default setting in our method. As for the latter setting, we feed the original sentence without entity markers into the subject module and only insert subject markers into the input sentence of the object module. This setting removes the information propagation from NER to RE. We observe that the first setting outperforms the second one by 3.0% on NER F1 and 3.1% on RE F1, proving the effectiveness of our fusion strategy and the importance of sequential information propagation between NER and RE. However, it is interesting that without this one-way information flow, the NER performance also decreases a lot. The possible reason is that without entity information, the RE (subject and object) modules have to do the extra work of entity extraction implicitly by itself, which can disturb the training of the NER module since the gradients are back-propagated through a shared encoder. Note that we cannot remove subject markers, because our object module is subject-oriented.

4.3.3 Training Strategy

When training object extraction modules, our default setting is to regard all ground-truth entities in the training dataset as candidate subjects, as described in Section C. Here we test the performance of using only ground-truth subjects. Results in Table 2 shows that only using golden subjects degrades the RE F1 score by 5%. This demonstrates that using all gold entities can alleviate the exposure bias problem of the object module by enabling the model to learn how to identify incorrect subject spans.

4.3.4 Joint Vs Independent

To explore whether RE helps NER, we compare the joint model with the independent NER model. Using the same model architecture, we test the performance of a single entity extraction module without jointly training subject extraction and object extraction. We observe that the joint model outperforms the single independent model on NER F1 score, proving the contribution of the RE information to NER.

4.4 Analysis

4.4.1 Analysis on entity prediction

To understand why our methods achieve better NER results than previous joint methods, especially PFN Yan et al. (2021) which also models inter-task interaction, we follow Yan et al. (2021) to design comparison experiments on NER by splitting entities into two groups: in-triple entities and out-of-triple entities. In-triple entities appear in relation triples in the dataset while out-of-triple have no relations with other entities. For ACE05, 64% entities in the test split are out-of-triple while the rest are in-triple. For SciERC, 22% entities in the test set are out-of-triple entities while the others are in-triple.

As shown in table 3, for out-of-triple entities, our model with the early fusion strategy outperforms PFN by 3.8% on sciERC and 1.0% on ACE05. As for in-triple results, our methods outperform PFN on SciERC and achieve close performance on ACE05. This demonstrates the superiority of our model in handling out-of-triple entities and the effectiveness of our multi-task interaction.

4.4.2 Analysis on fusion strategies

Strategy NER RE Speed (sentence/s)
early fusion 69.2 38.7 9.04
late fusion 68.6 36.7 295.04
Table 4: Comparison of F1 performance (%) and inference speed on SciERC test sets. The speed is measured on a single NVIDIA GeForce 2080Ti GPU with a batch size of 32.

As shown in table 4, We evaluate the inference speed of the two sequential information propagation strategies in our methods. We employ inference with early fusion strategy and late fusion strategy on SciERC test set with batch size 32 and SciBERT as the pre-trained embedder. We observe that the model with a late fusion strategy obtains a speedup compared with early fusion. This is because the late fusion strategy re-uses the embedder and encoder results which saves a lot of computations. Meanwhile, the model with early fusion achieves better performance by sacrificing time efficiency.

5 Conclusion

In this paper, we propose a unified cascade framework for entity-relation extraction by reformulating it as three span extraction tasks which unifies the task form and eliminates the gap between NER and RE. To achieve effective multi-task interaction, we propose a selection recurrent network which models implicit interaction among three sub-tasks and design two prior information fusing strategies to realize explicit sequential information flow between NER and RE. We conduct extensive experiments on two benchmarks to verify the effectiveness of our framework. We also employ ablation studies to explore how different factors impact the final performance and extensive analysis experiments to understand the reason of our improvements. Lastly, our framework successfully averts the weaknesses of previous approaches in modeling the interaction between NER and RE, and realizes the mutual benefits of the two tasks.

References

  • R. Bunescu and R. Mooney (2005) A shortest path dependency kernel for relation extraction. In

    Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 724–731. Cited by: §1.
  • Y. S. Chan and D. Roth (2011) Exploiting syntactico-semantic structures for relation extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 551–560. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), Cited by: Appendix E, §1, §3.2.
  • R. Florian, H. Hassan, A. Ittycheriah, H. Jing, N. Kambhatla, X. Luo, H. Nicolov, and S. Roukos (2004) A statistical model for multilingual entity detection and tracking. Technical report IBM THOMAS J WATSON RESEARCH CENTER YORKTOWN HEIGHTS NY. Cited by: §1, §2.
  • P. Gupta, H. Schütze, and B. Andrassy (2016)

    Table filling multi-task recurrent neural network for joint entity and relation extraction

    .
    In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2537–2547. Cited by: §1, §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix E.
  • Q. Li and H. Ji (2014) Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 402–412. Cited by: Table 1.
  • X. Li, J. Feng, Y. Meng, Q. Han, F. Wu, and J. Li (2020) A unified mrc framework for named entity recognition. In ACL, Cited by: Appendix B, §3.5.
  • X. Li, F. Yin, Z. Sun, X. Li, A. Yuan, D. Chai, M. Zhou, and J. Li (2019) Entity-relation extraction as multi-turn question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1340–1350. Cited by: Table 1.
  • Y. Luan, L. He, M. Ostendorf, and H. Hajishirzi (2018)

    Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction

    .
    In EMNLP, Cited by: Appendix D, §1, §4.1.1.
  • M. Miwa and M. Bansal (2016) End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1105–1116. Cited by: Table 1.
  • M. Miwa and Y. Sasaki (2014) Modeling joint entity and relation extraction with table representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1858–1869. Cited by: §1, §2.
  • L. Ratinov and D. Roth (2009) Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pp. 147–155. Cited by: §1.
  • E. F. Sang and F. De Meulder (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050. Cited by: §1, §1, §2.
  • T. Tran and R. Kavuluru (2019) Neural metric learning for fast end-to-end relation extraction. arXiv preprint arXiv:1905.07458. Cited by: §1, §2.
  • C. Walker, S. Strassel, J. Medero, and K. Maeda (2006) ACE 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia 57, pp. 45. Cited by: Appendix D, §1, §4.1.1.
  • J. Wang and W. Lu (2020) Two are better than one: joint entity and relation extraction with table-sequence encoders. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1706–1721. Cited by: Appendix E, §1, §2, Table 1.
  • Y. Wang, C. Sun, Y. Wu, J. Yan, P. Gao, and G. Xie (2020a) Pre-training entity relation encoder with intra-span and inter-span information. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1692–1705. Cited by: Table 1.
  • Y. Wang, C. Sun, Y. Wu, H. Zhou, L. Li, and J. Yan (2021) UniRE: a unified label space for entity relation extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 220–231. Cited by: §1, §2.
  • Y. Wang, B. Yu, Y. Zhang, T. Liu, H. Zhu, and L. Sun (2020b) TPLinker: single-stage joint extraction of entities and relations through token pair linking. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 1572–1582. Cited by: §1, §2.
  • Z. Wei, J. Su, Y. Wang, Y. Tian, and Y. Chang (2020) A novel cascade binary tagging framework for relational triple extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1476–1488. Cited by: §1, §2.
  • Z. Yan, C. Zhang, J. Fu, Q. Zhang, and Z. Wei (2021) A partition filter network for joint entity and relation extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 185–197. Cited by: Appendix B, Appendix E, Appendix F, Appendix F, Appendix G, Appendix H, §1, §1, §2, §4.1.2, §4.2.1, §4.4.1, Table 1, Table 3.
  • B. Yu, Z. Zhang, X. Shu, T. Liu, Y. Wang, B. Wang, and S. Li (2020) Joint extraction of entities and relations based on a novel decomposition strategy. In ECAI 2020, pp. 2282–2289. Cited by: §1, §2.
  • D. Zelenko, C. Aone, and A. Richardella (2003) Kernel methods for relation extraction.

    Journal of machine learning research

    3 (Feb), pp. 1083–1106.
    Cited by: §1.
  • L. Zhang, Y. Pan, and T. Zhang (2004) Focused named entity recognition using machine learning. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 281–288. Cited by: §1.
  • M. Zhang, Y. Zhang, and G. Fu (2017) End-to-end neural relation extraction with global optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1730–1740. Cited by: §1, §2.
  • Z. Zhong and D. Chen (2021) A frustratingly easy approach for entity and relation extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 50–61. Cited by: Appendix E, §1, §1, §2, §3.4.1, §4.1.2, §4.2.1, Table 1.

Appendix A Appendix

Appendix B Span Extraction Model

Entity span extraction, subject span extraction, and object span extraction are all performed by using a span extraction model with a unified architecture inspired by the span selection in machine reading comprehension Li et al. (2020). Specifically, we first adopt two kinds of identical binary sequence classifiers to detect the start and end position of entities/subjects/objects respectively by assigning each token a binary tag (0/1) that indicates whether the current token corresponds to a start or end position of a target span. Then, we use a token-pair classifier to match the detected start and end positions as a span. Here, we take entity extraction as an example to explain the process of span extraction.

Span Start/End Prediction
Figure 4: An example of span extraction output for entity extraction. We combine outputs of different entity types into one output space in this figure for ease of illustration. In the proposed span extraction model, two kinds of binary sequence classifiers are leveraged to predict the start and end sequence while a matching classifier is used to predict the start-end match table. P is short for Person. I is short for Institution. C is short for City. O is short for Others.

Given the task-specific representation

, we use a token-level classifier composed of a linear layer and a sigmoid function, which calculates the probability of the labels of tokens for each entity type

:

(12)

where , , and are trainable parameters of the classifier, and represent the probability of classifying token and as the start and end token of an entity span of type , respectively.

Start-End Matching

One sentence may include multiple entities of the same entity type. Hence multiple start tokens and multiple end tokens may be predicted from the classifiers. We additional develop a matching model to match the start token with its end token. The design of the matching model is similar to the table filling method Yan et al. (2021) as shown in Figure 4. For each token pair , we concatenate token-level entity features and , then feed them into a fully-connected layer with sigmoid function to calculate the probability of token and being the start and end position of the same -type entity.

(13)

Subject and object extraction is the same as the above entity extraction model. Similarly, we obtain subject predictions , and by subject model, obtain object results , and for given relation type and given subject by object model.

Appendix C Training and Inference

During training process, the overall loss function

is the sum of , and . The entity extraction module is trained via consisting of three parts: , , , which represent the loss of entity start, entity end and start-end match respectively. Similarly, trains the subject module and trains the object module.

(14)

where represents the binary label of whether token is the start of an -type entity span, represents the label of whether token is the end of an -type entity span, and represents start-end matching label for word pair . Similarly, we can compute the subject extraction loss and object extraction loss and the overall training objective is to be minimized the sum of the three span extraction losses:

(15)

where are weighting hyper-parameters in the range 0-1 to balance the effects of start/end/matching losses of each span extraction module.

For training the subject module and object module, we only consider the ground-truth entity labels as the input. In order to mitigate the exposure bias problem when training the object module, we regard all ground-truth entities (including those are not subjects) in the training data as candidate subjects and perform object extraction for each of them.

During inference, the entity extraction module first predicts the entities in the input sentence, then the predicted entity information is fused into the subject module to predict all candidate subjects. Finally, the object module extracts objects for these predicted subjects one by one, while being aware of the entity and subject information predicted by prior modules.

Appendix D Datasets

Dataset #sentences || ||
Train Dev Test
ACE05 10051 2424 2050 7 6
SciERC 1861 275 551 6 7
Table 5: Statistics of datasets. and are numbers of entity and relation types.

We evaluate our methods on two standard entity-relation extraction benchmarks: ACE05 Walker et al. (2006) and SciERC Luan et al. (2018), ACE05 is collected from various domains, including newswire and online forums. SciERC is collected from AI paper abstracts and contains annotations of scientific entities and their relations. The statistics of the datasets are given in Table 5.

We do not evaluate on another common dataset ACE04 because it does provide official train/val/test splits and easily renders the evaluation results incomparable across different previous works. Moreover, the data distribution of ACE04 overlaps with ACE05.

Appendix E Implementation details

We leveraged pre-trained language model BERT Devlin et al. (2019) as the text embedder in our methods. Following previous work Wang and Lu (2020); Zhong and Chen (2021); Yan et al. (2021), the versions we use are albert-xxlarge-v1 for ACE05 and scibert-scivocab-uncased for SciERC. Training batch size and learning rate are set to 8 and 1e-5, respectively. We set the hidden state dimension of the SRN encoder to 128. The model parameters were optimized by Adam optimizer Kingma and Ba (2014)

for 100 epochs. Also, to prevent gradient explosion, gradient clipping is applied during training. The model checkpoint is validated with an interval of 100 steps and selected based on the best sum of NER F1 score and RE F1 score on the validation set. All experiments are conducted on a single Tesla-P100 GPU or single NVIDIA GeForce 2080Ti GPU.

Appendix F Error Analysis of Relation Prediction

Figure 5: Distribution of six types of relation extraction errors on SciERC test set.

Our methods have similar RE performance to PFN Yan et al. (2021)

. We explore the specific difference by analyzing the error types for relation extraction. We split these errors into six groups: (1) Relation type error (RTE) means only the relation type of predicted relation triple is wrong. (2) Relation not found (RNF) means the golden relation which is not included in predicted relation triples. (3) Entity type error (ETE) means the entity type of subject or object in predicted relation triple is wrong. (4) Entity not found (ENF) means the subject or object in golden relation is not extracted by NER model. (5) Relation match error (RME) means the subject and object of the predicted relation triple are correct entities but they do not have the predicted relation. (6) Entity span error (ESE) means the subject or object of predicted relation is a wrong entity span.

Figure 5 presents the distribution of the six errors on SciERC test set for PFN Yan et al. (2021), Cascade-SRN (late fusion) and Cascade-SRN (early fusion). Compared with PFN Yan et al. (2021) and Cascade-SRN (late fusion), Cascade-SRN (early fusion) has fewer RNF and RME errors. These two error types are directly related to the effect of relation extraction, which proves the effectiveness of our framework and the early fusion strategy. However, early fusion strategy makes embedder and encoder aware of prior entity information for subject and object modules, so this makes our models more sensitive to entity prediction results, leading to more ESE errors.

Appendix G Selection Recurrent Network for Two Subtasks

Figure 6: Architecture of SRN for two subtasks.

In this section, we present the detail implementation of Selection Recurrent Network (SRN) for two subtasks (entity extraction and relation extraction). We calculate current state and filtered history state with forget and output gates as Equations 1, 2:

After obtaining and , we perform selecting operation to generate two different filtered memory which store intra-task information and one shared memory who store inter-task information.

Firstly, we calculate three master gates , , for two subtasks. Master gates will select task-related neuron from state representation. The element of these two gates means if corresponding neuron is belong to specific task. Like forget gate and output gate, we generate two master gates from history hidden state and input representation.

(16)

Then, based on these two master gates, we calculate one shared gates , and two independent task gates , .

(17)

To model interaction between two subtasks, information stored in independent memory is inaccessible for other tasks. Instead, information stored in shared memory can be leveraged by different tasks.

(18)

Combining independent memories and shared memories, we can update cell state and hidden state for next time step as Equation 8 and 9:

Meanwhile, we generate task specific independent representation and shared representation with corresponding memories:

(19)

After we obtain specific representations for all tokens in the sentence, we leverage these representations to achieve entity extraction and relation extraction.

Difference with PFN Yan et al. (2021)

Despite the similar objective to separate the task-specific and shared information in the token representations, SRN differs from PFN in the following aspects: (1) The selection mechanism in SRN uses simple gates to select the relevant neurons to specific tasks. Whereas the partition operation in PFN sorts neurons to aggregate neurons related to the same task so that PFN is only practicable for two-task learning. Instead, our selection operation does not need to sort the neurons, allowing the scalability of our model to multiple tasks without impairing the effectiveness. (2) PFN adds the independent memory and shared memory to obtain the final task-specific representation. In contrast, SRN directly utilizes the independent memory and leaves the work to the downstream task models on how to utilize task-specific and shared information from the representations. This allows more flexible use of the SRN outputs. (3) Inspired by the LSTM model, we incorporate the forget gate and output gate. Compared to PFN, forget gates allows our model to learn the forget mechanism. Besides, in PFN, the hidden state is just the result of cell state through a tanh function, while we add an output gate that enhances the expressing ability of our model.

Appendix H Analysis of Encoder

Model Entity Relation
PFN 66.8 38.4
Table-SRN 67.8 38.3
Table 6: Comparison results of PFN and SRN under a two-task joint framework.

SRN has a similar function as PFN Yan et al. (2021) encoder, the main difference is that PFN uses a partition operation while SRN uses a selection operation The flexibility of the selection operation enables SRN to encode more than two sub-tasks. We perform experiments to check if replacing the PFN with SRN hurts the performance under the situation with two sub-tasks. We develop an SRN-Table model which only replaces PFN encoder of the joint table-filling framework designed by Yan et al. (2021). As shown in Table 6, under the same entity-relation extraction framework, the model using SRN encoder increases the NER F1 score by 1% and has similar RE performance to PFN. The results demonstrate SRN not only has better scalability but also improves the original encoding ability.