Relation extraction (RE) tasks aim to find the semantic relation between a pair of entity mentions from texts. It plays a key role in many important NLP problems such as automatic knowledge base completion Surdeanu et al. (2012); Riedel et al. (2013); Verga et al. (2016) and question answering Yih et al. (2015); Yu et al. (2017).
One particular type of relation extraction tasks is multi-relation extraction (MRE). MRE aims to recognize relations of multiple pairs of entity mentions in an input paragraph. This task has important and practical implications, since it is more common to have a paragraph containing multiple pairs of entities. Existing approaches on MRE tasks Qu et al. (2014); Gormley et al. (2015); Nguyen and Grishman (2015) propose to adopt the methods for Single-Relation Extraction tasks to solve MRE tasks. These works treat each pair of entity mentions as an independent instance; and rely on features describing the words’ relations to the two entities as additional structural information. But these type of methods have a major drawback - they need to encode the same paragraph multiple times to predict the relations between different entity pairs (multi-pass). These multiple passes over a single paragraph are computationally expensive and can become prohibitive especially when encoding the paragraphs using deep models.
In this paper, we focus on tackling the inefficient multi-pass issue of existing solutions in MRE tasks. We successfully make use of pre-trained general-purposed language encoders, specifically, the Bidirectional Encoder Representations from Transformers (BERT) Devlin et al. (2018), to encode the input paragraph only once (one-pass) and also achieves a new state-of-the-art accuracy. The key idea of our approach is to enable the representation of a paragraph to distinguish relation mentions associated with different entity mentions, i.e., the paragraph representations should be entity-aware. However, the pre-trained encoders themselves are not entity-aware since they take only raw texts as inputs, which is a problem. Thus, we propose and evaluate two novel designs to solve this problem: (1) we introduce a structured prediction layer for predicting multiple relations for different entity pairs; (2) we make the self-attention layers to be aware of the positions of all entities in the input paragraph.
With the aforementioned model design, our approach can accurately predict multiple relations for all entity pairs in a single pass of the input paragraph. Our experiments confirmed that our approach achieves the state-of-the-art results on the ACE 2005 benchmark. To the best of our knowledge, this work is the first approach that can solve MRE task accurately to achieve state-of-the-art results and fast (in one-pass).
2.1 Multi-Relation Extraction
Multi-relation extraction (MRE) is an important information extraction task and can be applied to several real-world problems. Several benchmarks were constructed for evaluation in MRE tasks such as ACE Walker et al. (2006) and ERE Linguistic Data Consortium (2013).
In an MRE task, the input is a text paragraph containing words and entity mentions . Different mentions and can be the same type of entity or two different entities. The task is to predict a relation between each pair 111Since in MRE, the arguments are always entity mentions. In this paper we use the terms entity mention and entity interchangeably.. is from a list of pre-defined relations or is a special class NA meaning no relations exist between .
Many MRE approaches have been proposed to solve this challenging problem. These approaches focus either on feature and model architecture selections Gormley et al. (2015); Nguyen and Grishman (2015), or on domain adaptation of MRE models Fu et al. (2017); Shi et al. (2018). However, these approaches require multiple passes over the paragraph, as they treat MRE as multiple passes of a SRE model. To the best of our knowledge our work is the first one to investigate solving of MRE in a single pass over the paragraph, and also achieves state-of-the-art performance.
It is worth noting that our task is different from the one in Verga et al. (2018). Verga et al. (2018) focus on one-pass extraction of a single relation between a pair of entities, each of which could have multiple mentions. By comparison, we focus on one-pass extraction of multiple relations between different pairs of entity mentions. Different mention pairs of the same entity pair could have different relations. Their method cannot distinguish relation mentions between different pairs of entity mentions, therefore does not fulfill the goal of MRE, as confirmed by our experiments.
2.2 Pre-Trained Language Encoders
Recently, pre-trained general-purposed language encoders, such as CoVe McCann et al. (2017), ELMo Peters et al. (2018), GPT Radford et al. (2018) and BERT Devlin et al. (2018), have generated a lot of interest in the NLP community. These approaches benefit from training on huge unlabeled text corpora in order to achieve generalizable text embeddings.
Our approach builds on top of the representations from the pre-trained deep bidirectional transformers (BERT). We first briefly describe BERT and then (in Section 3) show how our approach uses it. Transformers in BERT consist of multiple layers, each of which implements a self-attention sub-layer with multiple attention heads. Each output of a self-attention sub-layer,
, is computed as the weighted sum of linearly transformed input elements, which are the outputs of the previous layers:
The self-attention scores are computed by comparing each pair of the input elements:
where is the dimension of the output from self-attention sub-layer, and , , are parameters of the model. BERT is pre-trained on both word-level and sentence-level language modeling tasks. As a result, it has achieved substantial improvements when fine-tuned on benchmarks with smaller labeled data, e.g., textual entailment, reading comprehension and many text classification tasks.
Despite the above successes, there are limited studies on applying these encoders to tasks with structured inputs. This is mainly because the models are pre-trained to optimize the language modeling objectives with raw texts as inputs. However, in some tasks, such as the MRE task, the structural information is crucial. In MRE, the same paragraph may represent different relations for different entity pairs. A model will output the same results for a paragraph if the target entities are ignored during encoding. Therefore, in this work, we take the MRE task as an example to study how to better infuse structural information into the pre-trained encoders.
Another problem of the pre-trained encoders is that they are deep models with large hidden states, thus run much slower compared to previous networks. Therefore their usages give rise to an obvious trade-off between accuracy and speed. The problem becomes more serious in standard MRE pipeline, which requires to re-encode the whole same paragraph for every entity pair. As a result, it is crucial in practice if an MRE model could predict multiple relations for all entity pairs with only one-pass of paragraph encoding.
3 Proposed Approach
In this section, we describe our MRE solution which is built on top of the pre-trained language encoder BERT. As shown in the method descriptions and experiments, the transformer architecture used in BERT makes it easier to deeply infuse structural input information (instead of shallow infusion as additional input features in previous work) into the encoding stage to better solve the MRE task.222We can also use other transformer-based pre-trained encoders, e.g. Radford et al. (2018) to solve this problem. The comparison among different pre-trained transformers is out of the scope for this paper.
Departing from the standard BERT structure, we first add a structured prediction layer to enable MRE with only one-pass encoding of the input (Section 3.1). Second, we introduce an entity-aware self-attention mechanism (Section 3.2) for the infusion of relational information with regard to multiple entities at each hidden state. The overall proposed framework is illustrated in Figure 1.
3.1 Structured Prediction of Multi-Relations with BERT
The BERT model has been applied to classification, sequential labeling and text span extraction (for reading comprehension) Devlin et al. (2018). However, the types of final prediction layers used in the above tasks do not fit the structured prediction natural of the MRE task. Note that in MRE, we essentially do edge prediction over a graph with entities as nodes. Inspired by Dozat and Manning (2018); Ahmad et al. (2018), we propose a simple yet efficient approach for multi-relation prediction. After the input paragraph has been encoded by BERT, we take the representations of entity mentions from the final BERT layer. Since one entity mention may contain multiple words and the BERT uses byte pair encoding (BPE), usually one mention can correspond to multiple hidden states. Therefore, we perform average-pooling over all the tokens’ hidden states to get a single representation for each entity mention. Finally, for a pair of entity mentions
, we denote their representation vectors asand .
We concatenate and and pass it to a linear classification layer333 We also tried to use MLP and Biaff instead of the linear layer for the classification, which perform worse as shown in the experiments. We assume that this is because the embeddings learned by BERT are powerful enough for linear classifiers.
We also tried to use MLP and Biaff instead of the linear layer for the classification, which perform worse as shown in the experiments. We assume that this is because the embeddings learned by BERT are powerful enough for linear classifiers.to predict the relation between and :
where , is the dimension of BERT embedding at each token position, and is the number of relation labels, i.e., . Figure 1 illustrates the above structured prediction strategy. For different pairs of entities, e.g., (Iraqi and artillery), (southern suburbs, Baghdad), different final-layer embeddings are used for relation prediction. For entities containing multiple words (southern suburbs) or being tokenized into multiple sub-tokens (Baghdad), average-pooling is applied to get the mention embeddings.
The training objective of an input paragraph is defined as the sum over the log-likelihood of all the entity pairs whose relation labels are required for prediction, denoted as ,
During prediction, all relations between the designated entity pairs in a single paragraph are extracted with one-pass of the BERT encoding process.
3.2 Entity-Aware Self-Attention based on Relative Distance
This section describes an approach to encode multi-relation information into the hidden states of the model. By highlighting the entity positions in the attention layer, the hidden state of each token focuses more on how the token interacts with all the entities, and the hidden state of each entity is encouraged to capture features indicating multiple relations. As a result, the encoder is aware of all the entity mentions in the paragraph and thus can capture relational information associated with all the mentions.
The key idea of our approach is to use the relative distances between a single word and all the entities to guide the attention computation. Following Shaw et al. (2018), we achieve this with the following formulation. For each pair of word tokens with the input representations from previous layer and , if there is a distance category (defined later), we extend the computation of self-attention in Eq. 1 and 2 as:
Here and can be viewed as (two different sets of) embeddings of the category . During fine-tuning of BERT models, these type embeddings are trained from scratch, together with the pre-trained BERT parameters. An additional benefit of our method is that the category embeddings can capture important task-specific information such that the pre-trained BERT parameters won’t be changed dramatically.444This is implied by our initial experiments: when sub-sampling 10% training data, the performance drop of the entity-aware self-attention approach is smaller than using entity indicator on input-layer.
Specifically, we define the relative distance category as follows:
If either or is inside an entity, we define a category as the clipped relative distance:
The operator maps all the numbers larger than to , and those smaller than to .
is a hyperparameter to be tuned on the development set.
If neither nor are entity mentions, we explicitly assign a zero vector to and . In this situation, the computation of self-attention scores will be the same as in the standard BERT self-attention.
When both and are inside entities, we define as in Eq. 5 (the row-wise definition). This is from the intuition that the row-wise attention plays a more important role for encoding entity information, as the attention and states computed in Eq. 3 and Eq. 4 are summed column-wise.
Figure 2 illustrates the resulted category embeddings for all the s.
|Baselines w/o Domain Adaptation (Single-Relation per Pass)|
|Hybrid FCM Gormley et al. (2015)||-||63.48||56.12||55.17||58.26|
|Best published results w/o DA (from Fu et al.)||-||64.44||54.58||57.02||58.68|
|BERT fine-tuning out-of-box||3.66||5.56||5.53||1.67||4.25|
|Baselines w/ Domain Adaptation (Single-Relation per Pass)|
|Domain Adversarial Network Fu et al. (2017)||-||65.16||55.55||57.19||59.30|
|Genre Separation Network Shi et al. (2018)||-||66.38||57.92||56.84||60.38|
|Multi-Relation per Pass|
|BERT (our model in $3.1)||64.42||67.09||53.20||52.73||57.67|
|Entity-Aware BERT (our full model)||67.46||69.25||61.70||58.48||63.14|
|BERT w/ entity-indicator on input-layer||65.32||66.86||57.65||53.56||59.36|
|BERT w/ pos-emb on final att-layer||67.23||69.13||58.68||55.04||60.95|
|Single-Relation per Pass|
|BERT (our model in $3.1)||65.13||66.95||55.43||54.39||58.92|
|Entity-Aware BERT (our full model)||68.90||68.52||63.71||57.20||63.14|
|BERT w/ entity-indicator on input-layer||67.12||69.76||58.05||56.27||61.36|
This section evaluates our proposed method on the popular MRE benchmark, ACE 2005. We also report results on the commonly used single-relation benchmark SemEval 2010 task 8 Hendrickx et al. (2009), where only one relation is required to be predicted from each paragraph.
Following previous work Gormley et al. (2015), we adopt the multi-domain setting and use the English portion of the ACE 2005 corpus Walker et al. (2006). We train on the union of news domain (nw and bn), tune hyperparameters on half of the broadcast conversation (bc) domain, and evaluate on the remainder of broadcast conversation (bc), the Telephone Speech (cts), Usenet Newsgroups (un), and Weblogs (wl) domains. For the SemEval data, we use the standard data split.
We compare with the published results from previous works that predict a single relation per pass Gormley et al. (2015); Nguyen and Grishman (2015); Fu et al. (2017); Shi et al. (2018), as well as the following modifications of BERT that could achieve MRE in one-pass.
BERT, i.e. BERT w/ structured prediction only, which is our method in Section 3.1.
BERT w/ position embedding on the final attention layer. This is a more straightforward way to achieve MRE in one-pass derived from previous works using position embeddings Nguyen and Grishman (2015); Fu et al. (2017); Shi et al. (2018). In this method, the BERT model encodes the paragraph until the last attention-layer. Then for each entity pair, it takes the hidden states, adds the relative position embeddings corresponding to the target entities, and makes the relation prediction for this pair. For an input paragraph, most of the BERT layers run only once thus we consider it as an MRE-in-one-pass method.
BERT w/ entity indicators on input layer, which replaces our structured attention layer, and directly adds indicators of entities (transformed to embeddings) to each token’s word embedding555Note the usage of relative position embeddings does not work for one-pass MRE, since each word corresponds to a varying number of position embedding vectors. Summing up the vectors confuses the information. It works for the single-relation per pass setting, but the performance lags behind using only indicators of the two target entities.. This method is an extension of Verga et al. (2018) to the MRE scenario.
4.2 Results on ACE 05
Table 1 gives the overall results on ACE 2005. The first observation is that our model architecture achieves much better results compared to the previous state-of-the-art methods, both without and with domain adaptation techniques. Note that although our method was not designed for domain adaptation, it outperforms such methods which further demonstrates its effectiveness.
Among all the BERT-based approaches, fine-tuning the out-of-box BERT does not give reasonable results, because the sentence embeddings cannot distinguish different entity pairs. While our BERT can successfully adapt the pre-trained BERT to the MRE task, and achieves comparable results with the state-of-the-art (w/o DA). Our structured fine-tuning of attention layers brings a further improvement of about 5.5%, in the MRE one-pass setting. It also improves most over the BERT method compared to the other two MRE in one-pass methods.
Performance Gap between MRE in One-Pass and Multi-Pass
Usually, the MRE-in-one-pass models can also be used to train and test with one entity pair per pass (Single-Relation per Pass results in Table 1). Therefore, we compare the same methods when applied to the multi-relation and single-relation settings. For BERT w/ entity indicators on inputs, it is expected to perform slightly better in the single-relation setting, because of the mixture of information from multiple pairs. A two percent gap is observed as expected. By comparison, our full model has a much smaller performance gap between two different settings (and no consistent performance drop over domains).
The BERT is not expected to have a gap as shown in the table. We hypothesize it might be a random result caused by differences in training objectives (and epochs). For BERT w/ position embeddings on the final attention layer, we trained the model in the single-relation setting and testing with two different settings, so the results are the same.
Training and Inference Time Analysis
Table 2 shows the running time of different models in three aspects: the training time to achieve the best model in Table 1, the total testing time of the methods on dev dataset, and the numbers of relations predicted per second on dev dataset. Our approach is significantly faster compared to all the other methods. It is also much faster compared to the baseline MRE-in-one-pass approach, BERT with position embeddings on the final attention layer, because this baseline runs the last layer one time for each entity pair.
|Our multi-relation BERT||134||63||126|
|BERT + pos-emb on last-layer||401||105||76|
Prediction Module Selection
Finally, Table 3 evaluates the usage of different prediction layers, including replacing our linear layer with MLP or Biaff. Results show that the usage of the linear predictor gives significantly better results. This is consistent with the motivation of the pre-trained encoders: by unsupervised pre-training the encoders are expected to be sufficiently powerful thus adding more complex layers on top does not improve the capacity but leading to more free parameters and higher risk of over-fitting.
|Our full model||67.46||69.25||61.70||58.48||63.14|
|replacing linear with MLP||67.16||68.52||61.16||54.72||61.47|
|replacing linear with biaff||67.57||69.24||60.91||56.60||62.25|
4.3 Additional Results on SemEval
|Best published result Wang et al. (2016)||88.0|
We conduct additional experiments on the commonly used relation classification task, SemEval 2010 Task 8, in order to compare with models developed on this benchmark. From the results in Table 4, our proposed techniques also help to outperform the state-of-the-art on this single-relation benchmark.
Because this is a single relation task, the out-of-box BERT itself could achieve a reasonable result after fine-tuning. Adding structured attention to BERT gives about 8% improvement, due to the availability of the entity information during encoding. Adding structured prediction layer to BERT (BERT) also leads to a similar amount of improvement. However, the gap between BERT method with and without structured attention layer is small. This is likely because of the bias of data distribution: the assumption that only two target entities exist makes the two techniques have similar effects.
We propose the first system that simultaneously extracts multiple relations with one-time encoding of an input paragraph. With the proposed structured prediction and entity-aware self-attention layers, we made better use of the pre-trained BERT model and achieve significant improvement on the ACE 2005 data.
- Ahmad et al. (2018) Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Hovy, Kai-Wei Chang, and Nanyun Peng. 2018. Near or far, wide range zero-shot cross-lingual dependency parsing. arXiv preprint arXiv:1811.00570.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dozat and Manning (2018) Timothy Dozat and Christopher D Manning. 2018. Simpler but more accurate semantic dependency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 484–490.
Fu et al. (2017)
Lisheng Fu, Thien Huu Nguyen, Bonan Min, and Ralph Grishman. 2017.
Domain adaptation for relation extraction with domain adversarial neural network.In
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 425–429.
- Gormley et al. (2015) Matthew R Gormley, Mo Yu, and Mark Dredze. 2015. Improved relation extraction with feature-rich compositional embedding models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1774–1784.
Hendrickx et al. (2009)
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2009.Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pages 94–99. Association for Computational Linguistics.
- Linguistic Data Consortium (2013) Linguistic Data Consortium. 2013. Deft ere annotation guidelines: Relations v1.1. 05.17.2013.
- McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305.
- Nguyen and Grishman (2015) Thien Huu Nguyen and Ralph Grishman. 2015. Combining neural networks and log-linear models to improve relation extraction. arXiv preprint arXiv:1511.05926.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 2227–2237.
- Qu et al. (2014) Lizhen Qu, Yi Zhang, Rui Wang, Lili Jiang, Rainer Gemulla, and Gerhard Weikum. 2014. Senti-lssvm: Sentiment-oriented multi-relation extraction with latent structural svm. Transactions of the Association for Computational Linguistics, 2:155–168.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf.
- Riedel et al. (2013) Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 74–84.
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In NAACL-HLT, page 464–468.
- Shi et al. (2018) Ge Shi, Chong Feng, Lifu Huang, Boliang Zhang, Heng Ji, Lejian Liao, and Heyan Huang. 2018. Genre separation network with adversarial training for cross-genre relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1018–1023.
- Surdeanu et al. (2012) Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 455–465.
- Verga et al. (2016) Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, and Andrew McCallum. 2016. Multilingual relation extraction using compositional universal schema. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 886–896.
- Verga et al. (2018) Patrick Verga, Emma Strubell, and Andrew McCallum. 2018. Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In NAACL 2018, pages 872–884.
- Walker et al. (2006) Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. Ace 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57.
- Wang et al. (2016) Linlin Wang, Zhu Cao, Gerard de Melo, and Zhiyuan Liu. 2016. Relation classification via multi-level attention cnns. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1298–1307.
- Yih et al. (2015) Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1321–1331.
- Yu et al. (2017) Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2017. Improved neural relation detection for knowledge base question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 571–581.