Recurrent Interaction Network for Jointly Extracting Entities and Classifying Relations

05/01/2020 ∙ by Kai Sun, et al. ∙ Beihang University uOttawa 0

Named entity recognition (NER) and Relation extraction (RE) are two fundamental tasks in natural language processing applications. In practice, these two tasks are often to be solved simultaneously. Traditional multi-task learning models implicitly capture the correlations between NER and RE. However, there exist intrinsic connections between the output of NER and RE. In this study, we argue that an explicit interaction between the NER model and the RE model will better guide the training of both models. Based on the traditional multi-task learning framework, we design an interactive feature encoding method to capture the intrinsic connections between NER and RE tasks. In addition, we propose a recurrent interaction network to progressively capture the correlation between the two models. Empirical studies on two real-world datasets confirm the superiority of the proposed model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Named entity recognition (NER) and relation extraction (RE) are two crucial tasks for information extraction from textual data. NER aims to extract all entities in the sentence. RE aims to classify the relation between given entities. In practice, both of two tasks are required to be solved simultaneously. Consider the sentence, John was born in Sheffield which is a city of England as an example. The goal of the joint entity and relation extraction is to identify all the relational triples (Sheffield, birth_place_of, John) and (England, contains, Sheffield). This joint task plays a vital role in extracting structured knowledge from unstructured text, which is deemed important for several applications, including knowledge base construction Komninos and Manandhar (2017); Deng et al. (2019); Nathani et al. (2019).

The simplest approach to solve this joint task is to utilize a pipeline-based approach by firstly extracting all entities in the sentence and then classifying the relation between all entity pairs Zelenko et al. (2003); Zhou et al. (2005); Chan and Roth (2011). However, the pipeline-based approaches omit the correlation between NER and RE tasks and may result in error propagation.

Considering the close correlation between NER and RE tasks, many recent studies have been focused on joint relation extraction and named entity recognition. Multi-task learning (MTL) techniques Collobert and Weston (2008) have been exploited Miwa and Bansal (2016); Zeng et al. (2018); Fu et al. (2019) to capture the correlation between NER and RE and improve joint extraction performance. These MTL-based models implicitly model the correlation of NER and RE via a shared common representation. It is worth noting that the output of RE helps the prediction of NER and vice versa.

Take the sentence Reynolds had been in CBS for ten years as an example. If we know that there exist the relation company_of between CBS and Reynolds, then CBS and Reynolds

have high probabilities to be

Organization and Person while having low probabilities to be Person and Person. On the other hand, if we know CBS and Reynolds are Organization and Person, then the relation is impossible to be nationality_of. In this study, we regard NER and RE as dual tasks and introduce an interaction model between the output of NER and RE to explicitly leverage their correlation to guide the training process of both models.

(A) MTL (B) Interactive MTL
Figure 1: (A) A generalized multi-task model. (B) An interaction augmented multi-task model. is the input, and are two models for different tasks. They correspond to NER and RE in this study.

As shown in Figure 1 (A), most previous works jointly extracting entities and relations in a multi-task learning framework which focuses on learning share layers to extract the common features for NER and RE. Then, the learned common features are fed into two independent modules and for NER and RE. If computes the sufficient statistics of for predicting and , and are sufficiently expressive and data supports the learning of and , there is no need to have further interaction. However, there is not always the case, especially when and correlates to each other. An interaction between and can be introduced as shown in (B). Although the interactive MTL framework may be less expressive for the simple interaction, multiple interactions allow each component to be less expressive but overall the model is sufficiently expressive.

Follow this motivation, in this paper, we propose a recurrent interaction network (RIN) to capture the correlations between NER and RE dual tasks. Specifically, we present a method to learn the dual task interaction features which represent the “degree of alignment” of NER and RE on each word. We further introduce a recurrent structure to progressively refine the prediction of NER and RE based on the learned dual task interaction features. The empirical studies on NYT and WebNLG datasets achieve new state-of-the-art performances and confirm the effectiveness of the presented RIN. A further experiment by introducing a pre-trained BERT Devlin et al. (2019) model as the sentence encoder shows a significant performance gain over the BiLSTM encoder. This fact suggests the utilizing of BERT in the joint entity and relation extraction tasks.

2 Related Work

Extracting relation facts from the raw text is one of the most important tasks in natural language processing. In the earlier relation extraction (RE) task, the goal is to classify the relation between two given entities into one of the pre-defined relations. Most researchers adopted sequence based models and attention mechanisms to encode each word sentence and distill a vector representation, which is then passed to a classifier 

Zhang et al. (2015); Shen and Huang (2016); Wang et al. (2016). Some other studies also incorporate the dependency grammar information of the sentence into the encoder model to achieve a better representation and classification accuracy Xu et al. (2015); Miwa and Bansal (2016); Zhang et al. (2018); Guo et al. (2019). These methods, while effective, are limited to the relation extraction task with given entities.

A more challenging task is to extract all relational facts from an arbitrary sentence which are not accompanied by marked entities. Recently, deep neural network based joint models are exploited to unite NER and RE problems. In

Zheng et al. (2017), authors propose a tagging strategy to transfer this task into a sequence prediction problem, where the label of relation and entity type are shared in a common space. This joint model is capable of extracting entities and relations simultaneously. However, it fails to handle the case where more than one relations exist between two entities. In Zhang et al. (2018), authors propose an end-to-end sequence-to-sequence model which detects a relational triple by firstly decoding the relation, then decoding the two entities of the relation. However, the number of relational triples that can be extracted in a sentence is limited to a predefined constant. This model cannot extract entities containing multiple words as well. A more recent study Wei et al. (2019) solves these limitations by transferring the task to the subject and relation-specific object tagging task. A two-level framework is presented where the low-level tagging module recognizes all possible subjects and the high-level tagging module identifies all possible objects in each relation.

It is worth noting that above mentioned works seldom consider the implicit constraint and connections between NER and RE. Multi-task learning techniques are introduced to implicitly model the interaction between NER and RE. In  Fu et al. (2019), authors follow the generalized MTL framework and exploit Bi-RNN and GCN to extract both sequential and regional dependency word features of the sentence. The shared features are then fed into two independent classifiers for RE and NER respectively. As discussed in the Introduction section, the explicit correlation is also a potential constraint to improve the learning of both the NER and RE model.

He et al. (2019)

proposed an interactive multi-task learning network for jointly extracting aspects and classifying their sentiment. Both sub-tasks are regarded as a sequence prediction problem which is not the same as our joint NER and RE model. In addition, a linear transformation is exploited for the interaction of the outputs of sub-tasks is not sufficient to model the interaction between models.

Figure 2: Overview of RIN. The extracts relation-specific feature and extracts entity-specific feature from the sentence embedding . The is relation extraction model and the is the entity recognition model. INT encodes the interaction information between two sub-tasks.

3 Problem Statement

In this section, we formally describe the problem. For a set of pre-defined relation types, and a given sentence of words, the problem is to extract all relational triples for the given sentence. A single relational triple is defined as , where relation , entity words , . In the case where a phrase of multiple words forms an entity, we denote the entity by the beginning word of the entity phrase. Note that one word and even the same entity pair may involve multiple relation triples. And the sequential order of two words in the triple matters. From a probabilistic point of view, we predict the probability that the relational triple holds. When the relation is more likely to hold than not, i.e. , we can extract the relation.

4 Model

In this section, we describe our model. First, we introduce the recurrent interaction network (RIN). Next, we present the NER and RE modules. Finally, we show the input and training objective of our model. The framework of RIN is shown in Figure 2.

4.1 Recurrent Interaction Network

As we have discussed above, the output of RE helps the prediction of NER and vice versa. Based on this assumption, we aim to model the interaction of NER and RE and leverage the interaction result back to refine the prediction of NER and RE. Assume that for each word of the given sentence , we have extracted relation-specific feature vector and entity-specific feature vector based on word embedding . All the feature vectors of the words in make up the corresponding sentence embedding matrix . The NER model predicts the entity label information (represented in

for the moment) based on task features

and the RE module predicts the probability based on task features . The key idea behind our model is to encode the interaction among word embedding and subtask results , and then update task features based on the interaction features . is supposed to contain information about the “alignment” of NER and RE results on each word of a sentence.

We introduced the interaction (INT) module to extract the interaction information. For each word with word embedding , the INT module learns an interaction feature vector from subtask results and according to the following calculation.

(1)
(2)
(3)

where denotes the concatenation operation, and

is the ReLU activation function.

are learnable model parameters. In the calculation of , we consider the probability of word in all relations with some word, the possibility of word being an entity, and the word embedding. By combing these three kinds of information in the INT module, we aim to learn a feature that conveys information about the alignment of NER and RE on word . The interaction features on all words of the sentence make up the interaction feature matrix .

We employed two separate gated recurrent units (GRUs) to update task features

and based on interaction feature . Taking the updating of relation-specific task features as an example, the updated new task feature of a word is got based on the interaction feature of this word according the following calculation

(4)
(5)
(6)
(7)

where is the concatenation operation, and is the dot product operation. are learnable model parameters. The updating of entity-specific task features is similar with using a separate model with parameters .

The updating process can be run for K rounds. In the th updating round, relations are predicted based on and entities are labeled based on , then the interaction information is extracted. Based on , the relation-specific and entity-specific features are updated to . We believe that with the updating operation on task features in a recurrent way, the predictions of NER and RE are progressively refined in multi updating rounds. We also conduct experiments on different updating to verify our assumption. Finally, after the -th rounds updating, we use the finetuned representations and for final NER and RE.

4.2 Named Entity Recognition

The NER module recognizes all the entities in the sentence based on entity-specific features . As one entity can consist of multiple words, we formalize this problem as tagging each word with an entity label which takes values from (Begin, Inside, End, Single, Out). When a word is tagged a Begin

label, it is the beginning word of a detected entity. More specifically, the NER module classifies each word to one of the five label clusters. The probability distribution

of word over these five clusters is calculated based on the entity feature as follows:

(8)

where are learnable model parameters.

4.3 Relation Extraction

The RE module extracts all the relation triples in the given sentence based on relation-specific feature . Following Fu et al. (2019), we consider all the relations between all the word pairs in the sentence. For the word pair and the considered relation , the relation extraction is probabilistically formed as a binary classification problem. Specifically, the RE module calculate the probability that the relation holds. If the relation is more likely to hold than not, i.e. , we extract the relation. The classifier is defined as

(9)
(10)

where is the concatenation operation, is the ReLU activation function, is the sigmoid activation function. are learned model parameters. Note that different from Fu et al. (2019), we exploit a simgoid activation function rather than a softmax function in Eq. (10). Considering that there may exist more than one relations between the same word pair , Fu et al. (2019) using a softmax function can not address the overlapping problem.

4.4 Input of Model

The whole model takes the embedding of the given sentence as input and further extracts relation-specific features and . The embedding matrix of a sentence can be formed with each word embedding by looking up a pre-trained word embedding matrix. To further encode the contextual information into word embedding, a BiLSTM can be trained over the pre-trained word embeddings of each sentence. Alternatively, we can utilize the commonly used pre-trained model BERT to get sentence embedding from sentence words. We denote the learnable parameters in BiLSTM or BERT as .

After getting the representations of the sentence , we feed into two separate linear transformation modules to get the task-specific features and for each word. The relation-specific feature of word is extracted from word embedding according the linear transformations.

(11)

Where is the ReLU activation function, are learnable model parameters. The entity-specific feature of each word is extracted with similar linear transformations from with separate model parameters .

4.5 Training Objective

Training loss of the whole RIN model is comprised of two parts: the loss of relation extraction and the loss of named entity recognition . Assume that for each word , is the one-hot ground truth entity label, is the predictive distribution over five labels acquired from the NER module after round; the entity recognition loss on one word is the cross entropy between the true one-hot label and predictive distribution.

(12)

Assume that for each relation triple , is the one-hot ground truth label taking values if the relation holds and taking values otherwise, and is the probability that the relation holds acquired from the RE module after round, the predictive distribution is denoted as . Then the relation extraction loss on one relation triple is the cross entropy between the true one-hot label and predictive distribution.

(13)

The total loss over all words and relation triples for all sentences is then calculated as follows.

(14)

With gradient based algorithm, we seek to minimize the total loss over all model parameters to achieve good performance for both the NER and RE tasks.

NYT WebNLG
Model Precision Recall F1 Precision Recall F1
NovelTagging Zheng et al. (2017) 62.4 31.7 42.0 52.5 19.3 28.3
MultiDecoder  Zeng et al. (2018) 61.0 56.6 58.7 37.7 36.4 37.1
GraphRel Fu et al. (2019) 63.9 60.0 61.9 44.7 41.1 42.9
SeqtoSeq+RL Zeng et al. (2019) 77.9 67.2 72.1 63.3 59.9 61.6
HBT Wei et al. (2019) 89.7 85.4 87.5 89.5 88.0 88.8
BiLSTM (d) 79.0 77.4 78.2 84.9 86.3 85.6
BiLSTM_s (d) 78.1 78.0 78.1 85.0 85.8 85.4
RIN (d, ) 81.0 82.0 81.5 86.7 86.0 86.3
RIN (d, ) 81.1 83.2 82.1 86.0 88.0 87.0
RIN (d, ) 82.2 82.6 82.4 86.1 87.6 86.8
RIN (BERT, ) 88.5 86.5 87.5 89.1 90.3 89.7
RIN (BERT, ) 88.4 87.1 87.8 90.0 90.3 90.1
Table 1: Performance comparison of different models on the benchmark datasets. Average results over 5 runs are reported. is the number of updating rounds. The best performance is bold-typed.

5 Experiment

In this section, we conduct experiments to evaluate our model on two public datasets NYT Riedel et al. (2010) and WebNLG Gardent et al. (2017). NYT dataset was originally produced by a distant supervision method. It consists of M sentences with

predefined relation types. WebNLG dataset was created by Natural Language Generation (NLG) tasks and adapted by 

Zeng et al. (2018) for relational triple extraction task. It contains predefined relation classes. For a fair comparison, we directly use the preprocessed datasets provided by Zeng et al. (2018). For both datasets, we follow the evaluation setting used in previous works. An extracted relational triple (subject, relation, object) is regarded as correct only if the relation and the heads of both subject and object are correct. We report Precision, Recall and F1-score for all the compared models. The statistics of the datasets are summarized in Table 2

Dataset Train Dev Test
NYT 56195 5000 5000
WebNLG 5019 500 703
Table 2: Distribution of splits on NYT and WebNLG

5.1 Implementation Details

For a fair comparison with previous work, we use the pre-trained -dimensional embeddings provided by Zeng et al. (2018), as well as a -dimensional part-of-speech (POS) embeddings. We concatenate both word and POS embeddings and learn a -dimensional BiLSTM embedding for each word. We randomly dropout

of neurons in the input layer. The model is trained with batch size of

in both datasets. We use Adam optimizer with an initial learning rate of for all datasets. To compare with the SOTA model HBT Wei et al. (2019) which exploits the pre-trained BERT Devlin et al. (2019) model to initialize word embeddings, we follow their work using the same pre-trained BERT model which is [BERT-Base, Cased] 111https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip. In the BERT initialized setting, the model is trained with batch size of and in NYT and WenNLG. We use Adam optimizer with an initial learning rate of for both datasets. The code for our model is found on XXX 222.

5.2 Performance Comparison

We now show the results on NYT and WebNLG datasets. As a baseline, we include BiLSTM and BiLSTM_s. In BiLSTM, is fed into and for final NER and RE predictions. In BiLSTM_s, is firstly feed into and . The outputs of and are fed into and for final NER and RE predictions. We aslo compare with several recent models, including the sequential model NovelTagging Zheng et al. (2017), the encoder-decoder based models MultiDecoder Zeng et al. (2018) and Seq2Seq+RL  Zeng et al. (2019), the dependency based model GraphRel Fu et al. (2019), the hierarchical binary tagging framework HBT Wei et al. (2019). The result of Seq2Seq+RL is taken from Zeng et al. (2019) and the others from Devlin et al. (2019).

Result Discussion (100d) Table 1 shows the performances of different models. In the setting of -dimensional word embeddings, it can be seen that RIN consistently outperforms all previous models. Especially even the baseline models BiLSTM and BiLSTM_s significantly surpass SeqtoSeq+RL and MultiDecoder, revealing the superiority of word pair based methods over encoder-decoder based methods. To notice that GraphRel shows a low F1 performance of and on the two datasets. As discussed above, using the softmax function in the prediction of RE, GraphRel cannot address the cases where more than one relations exist between two entities. It may be the main reason for the low performance. We also find that RIN has significantly outperformed BiLSTM and BiLSTM_s on the two datasets with . This improvement proves the effectiveness of the interactive updating mechanism used in RIN. By setting and , RIN achieves the best F1 performance in WebNLG and NYT datasets. The F1 performances have been significantly improved compared to . The performance improvement achieved by increasing proves the effectiveness of the recurrent structure in the model.

Result Discussion (BERT) In the setting of BERT, we notice that F1 performances of RIN are further improved and surpass the non-BERT models by a large margin. With , RIN exceeds the non-BERT model by on F1 performance in the NYT dataset and has been competitive with the SOTA model HBT on both datasets. By setting to , RIN surpasses HBT by and on the F1 performance. These results show that the incorporating of BERT significantly improves the performance of RIN.

Figure 3: Curves of F1-score, Precision, Recall for BiLSTM_s and RIN on a different number of updating rounds.

5.3 Impact of Updating Rounds

In this section, we conduct experiments on NYT and WebNLG datasets to show the performance of RIN on a different number of updating rounds . To evaluate the effectiveness of GRU, we also present the performance of vanilla RNN Hochreiter and Schmidhuber (1997). The results are shown in Figure 3.

It can be seen that both RIN (RNN) and RIN (GRU) have significantly outperformed BiLSTM_s on F1 performance with . From the F1 curve of RIN (GRU) on the NYT dataset, we also find that as we increase the number of updating rounds , the F1 performance increases to an extent. In particular, RIN (GRU) increases in model performance over rounds. This progressive increased F1 performance verifies our original assumption that the performance is considered to be improved in the recurrent structure.

It can also be seen that the performances of RIN (GRU) consistently outperform RIN (RNN) in both datasets and the optimal round for RIN (GRU) is later than RIN (RNN). As shown in the first subfigure, the F1 performance of RIN (RNN) only increases in the first two rounds while the F1 performance of RIN (GRU) persists in increasing for 5 rounds. A similar phenomenon can also be found in the WebNLG. Consider that the gate mechanism used in GRU is designed to solve the problem of long term memories covered by the short term memories, RIN (GRU) is more adept than RIN (RNN) at leveraging historical updating information to adjust the updating process. From this perspective, RIN (GRU) is more expressive than RIN (RNN).

Case1: A cult of victimology arose and was happily exploited by clever radicals among Europe’s Muslims, especially certain religious leaders like Imam Ahmad Abu Laban in Denmark and Mullah Krekar in Norway. Golden:Europe, Denmark, Norway
(Europe, /location/location/contains, Denmark)
(Europe, /location/location/contains, Norway)
BiLSTM_s: Europe, Denmark, Norway
(Europe, /location/location/contains, Denmark)
RIN: Europe, Denmark, Norway
(Europe, /location/location/contains, Denmark)
(Europe, /location/location/contains, Norway)
Case2: Scott (No rating , 75 minutes) Engulfed by nightmares, blackouts and the anxieties of the age, a Texas woman flees homeland insecurity for a New York vision quest in this acute, resourceful and bracingly ambitious debut film. Golden: Scott, New York
(York, /location/location/contains, Scott)
BiLSTM_s: Texas, New York
(York, /location/location/contains, Scott)
RIN: Scott, New York
(York, /location/location/contains, Scott)
Table 3: Case study for RIN and BiLSTM_s. The entities and relational triples are marked by blue and orange.
Model RE NER
BiLSTM (d) 78.2 87.3
BiLSTM_s (d) 78.1 87.6
RIN (d, ) 81.5 90.1
RIN (d, ) 82.4 90.9
Table 4: F1 performance of NER and RE on the NYT dataset.

We also show the F1 performance of NER and RE on the NYT dataset. The results are presented in table 4. From the table, we find that both the performances of NER and RE are improved compared to BiLSTM and BiLSTM_s with the setting of . The performances are further improved by setting to . These results verify our argument that explicit interaction can enhance the performance on both sides.

5.4 Ablation Study

In this section, we perform the ablation study on RIN. The ablated models are noted as (1) RIN: RIN with removing from the inputs of , (2) RIN: RIN with removing from the input of , (3) RIN: RIN with removing from the input of , (4) RIN: RIN with removing the total which is equivalent to BiLSTM_s, (5) RIN: RIN with replacing , and with two separate linear transformations to update and . All the ablated models experiment with the setting of -dimensional pre-trained word embeddings and . The results are shown in table 5.

Model NYT WebNLG
RIN 81.5 86.3
RIN 78.4 85.5
RIN 80.9 85.7
RIN 81.2 85.9
RIN 78.1 85.4
RIN 77.8 84.9
Table 5: F1 performance on different ablation models.

We find that the performance of RIN deteriorates as we remove critical components. Specifically, RIN underperforms relative to RIN on both datasets, suggesting the importance of modeling the interaction of NER and RE for performance improvement. From the performance on NYT dataset, We also find that RIN, RIN and RIN underperform relative to RIN while outperforming relative to RIN, indicating the fact that all these three kinds of information play an important role in learning the interaction feature. Notice that F1 performance drops by removing . The performance deterioration is marginal comparing to removing or . It suggests that together with may play a pivotal role in providing “alignment” information for learning the interaction feature. From the performance of RIN, we observe that directly using two linear transformations to updating and hurts the performance. The F1 performance drops by and in NYT and WebNLG compared to RIN. This observation sufficiently proves that it is the learned interaction feature that plays the important role in refining the performances.

5.5 Case Study

In this section, we conduct a case study from NYT on RIN and BiLSTM_s. From the first case in Figure 3, we observe that BiLSTM_s misses the relational triple (Europe, /location/location/contains, Norway) while RIN extracts all the relational triples in the sentence. Although BiLSTM_s correctly extracts all the entities including Norway in the sentence, BiLSTM_s cannot leverage the prediction state of NER to refine its RE without interaction. In contrast, RIN captures this “alignment” information and correctly extracts the relational triple which contains the entity Norway.

From the second case in Figure 3, similarly, we observe that both RIN and BiLSTM_s correctly extract the relational triple (York, /location/location/contains, Scott). However, BiLSTM_s identifies Texas as an entity by error while RIN correctly extract the entity Scott that involves in the relation /location/location/contains. This fact suggests that RIN is capable of leveraging the prediction state of RE to refine its NER and is prone to extract the word which involves the relational triple as an entity.

6 Conclusion

This paper studies the joint entity and relation extraction problem. Existing multi-task learning based models implicitly characterizing the commonalities and differences via shared representations. We argue that an explicit interaction between these two tasks can improve the performance on both sides. In this study, we present a recurrent interaction network to capture the intrinsic connection between two sub-tasks. Specifically, the features that represent the interaction between NER and RE are encoded into a distributed representation. Besides, a recurrent module is proposed to progressively accumulate the dependencies. Empirical studies on two publicly available datasets confirm the effectiveness of the presented model.

References

  • Y. S. Chan and D. Roth (2011) Exploiting syntactico-semantic structures for relation extraction. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pp. 551–560. External Links: Link Cited by: §1.
  • R. Collobert and J. Weston (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, pp. 160–167. External Links: Link, Document Cited by: §1.
  • Y. Deng, Y. Xie, Y. Li, M. Yang, N. Du, W. Fan, K. Lei, and Y. Shen (2019) Multi-task learning with multi-view attention for answer selection and knowledge base question answering. In

    The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019

    ,
    pp. 6318–6325. External Links: Link, Document Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Link Cited by: §1, §5.1, §5.2.
  • T. Fu, P. Li, and W. Ma (2019) GraphRel: modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 1409–1418. External Links: Link Cited by: §1, §2, §4.3, Table 1, §5.2.
  • C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini (2017) Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 179–188. External Links: Link, Document Cited by: §5.
  • Z. Guo, Y. Zhang, and W. Lu (2019) Attention guided graph convolutional networks for relation extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 241–251. External Links: Link Cited by: §2.
  • R. He, W. S. Lee, H. T. Ng, and D. Dahlmeier (2019)

    An interactive multi-task learning network for end-to-end aspect-based sentiment analysis

    .
    In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 504–515. External Links: Link Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §5.3.
  • A. Komninos and S. Manandhar (2017) Feature-rich networks for knowledge base completion. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pp. 324–329. External Links: Link, Document Cited by: §1.
  • M. Miwa and M. Bansal (2016) End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link Cited by: §1, §2.
  • D. Nathani, J. Chauhan, C. Sharma, and M. Kaul (2019)

    Learning attention-based embeddings for relation prediction in knowledge graphs

    .
    In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 4710–4723. External Links: Link Cited by: §1.
  • S. Riedel, L. Yao, and A. McCallum (2010) Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part III, pp. 148–163. External Links: Link, Document Cited by: §5.
  • Y. Shen and X. Huang (2016)

    Attention-based convolutional neural network for semantic relation extraction

    .
    In COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pp. 2526–2536. External Links: Link Cited by: §2.
  • L. Wang, Z. Cao, G. de Melo, and Z. Liu (2016) Relation classification via multi-level attention cnns. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link Cited by: §2.
  • Z. Wei, J. Su, Y. Wang, Y. Tian, and Y. Chang (2019) A novel hierarchical binary tagging framework for joint extraction of entities and relations. CoRR abs/1909.03227. External Links: Link, 1909.03227 Cited by: §2, Table 1, §5.1, §5.2.
  • Y. Xu, L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin (2015) Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1785–1794. External Links: Link Cited by: §2.
  • D. Zelenko, C. Aone, and A. Richardella (2003) Kernel methods for relation extraction. J. Mach. Learn. Res. 3, pp. 1083–1106. External Links: Link Cited by: §1.
  • X. Zeng, S. He, D. Zeng, K. Liu, S. Liu, and J. Zhao (2019)

    Learning the extraction order of multiple relational facts in a sentence with reinforcement learning

    .
    In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 367–377. External Links: Link, Document Cited by: Table 1, §5.2.
  • X. Zeng, D. Zeng, S. He, K. Liu, and J. Zhao (2018) Extracting relational facts by an end-to-end neural model with copy mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 506–514. External Links: Link, Document Cited by: §1, Table 1, §5.1, §5.2, §5.
  • S. Zhang, D. Zheng, X. Hu, and M. Yang (2015) Bidirectional long short-term memory networks for relation classification. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, PACLIC 29, Shanghai, China, October 30 - November 1, 2015, External Links: Link Cited by: §2.
  • Y. Zhang, P. Qi, and C. D. Manning (2018) Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 2205–2215. External Links: Link Cited by: §2, §2.
  • S. Zheng, F. Wang, H. Bao, Y. Hao, P. Zhou, and B. Xu (2017) Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1227–1236. External Links: Link, Document Cited by: §2, Table 1, §5.2.
  • G. Zhou, J. Su, J. Zhang, and M. Zhang (2005) Exploring various knowledge in relation extraction. In ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25-30 June 2005, University of Michigan, USA, pp. 427–434. External Links: Link Cited by: §1.