How Knowledge Graph and Attention Help? A Quantitative Analysis into Bag-level Relation Extraction

07/26/2021 ∙ by Zikun hu, et al. ∙ National University of Singapore Nanyang Technological University Virginia Polytechnic Institute and State University 0

Knowledge Graph (KG) and attention mechanism have been demonstrated effective in introducing and selecting useful information for weakly supervised methods. However, only qualitative analysis and ablation study are provided as evidence. In this paper, we contribute a dataset and propose a paradigm to quantitatively evaluate the effect of attention and KG on bag-level relation extraction (RE). We find that (1) higher attention accuracy may lead to worse performance as it may harm the model's ability to extract entity mention features; (2) the performance of attention is largely influenced by various noise distribution patterns, which is closely related to real-world datasets; (3) KG-enhanced attention indeed improves RE performance, while not through enhanced attention but by incorporating entity prior; and (4) attention mechanism may exacerbate the issue of insufficient training data. Based on these findings, we show that a straightforward variant of RE model can achieve significant improvements (6 AUC on average) on two real-world datasets as compared with three state-of-the-art baselines. Our codes and datasets are available at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relation Extraction (RE) is crucial for Knowledge Graph (KG) construction and population. Most recent efforts rely on neural networks to learn efficient features from large-scale annotated data, thus correctly extract the relationship between entities. To save the manual annotation cost and alleviate the issue of data scarcity, distant supervision relation extraction (DSRE)

Mintz et al. (2009) is proposed and becomes increasingly popular as it can automatically generate large-scale labeled data. DSRE is based on a simple yet effective principle: if there is a relation between two entities in KG, then all sentences containing mentions of both entities are assumed to express this relation and will form a sentence bag as its annotations.

Figure 1: Examples of disturbing bags in NYT-FB60K.

Although effective, distant supervision may introduce noise to a sentence bag when the assumption fails — some sentences are not describing the target relation Zeng et al. (2015) (a.k.a. noisy annotation). To alleviate the negative impacts of noise, recent studies Lin et al. (2016); Ji et al. (2017); Du et al. (2018); Li et al. (2020) leveraged attention to select informative instances from a bag. Furthermore, researchers introduced KG embeddings to enhance the attention mechanism Hu et al. (2019); Han et al. (2018a). The basic idea is to utilize entity embeddings as the query to compute attention scores, so that the sentences with high attention weights are more likely to be valid annotations Zhang et al. (2019). Previous studies have shown performance gain on DSRE with attention module and KG embeddings, however, it’s still not clear how these mechanisms work, and, are there any limitations to apply them?

In this paper, we aim to provide a thorough and quantitative analysis about the impact of both attention mechanism and KG on DSRE. By analyzing several public benchmarks including NYT-FB60K Han et al. (2018a), we observe lots of disturbing bags — all of the bag’s sentences are valid or noisy annotations, which shall lead to the failure of attention. As shown in Figure-1, all of annotations in the first disturbing bag are valid, while the learned attentions assign the second annotation with a very low weight, which suggests an inefficient utilization of annotations and exacerbates the data sparsity issue. Or, in the second bag, all sentences are noisy, can attention and KG still improve the performance? If so, how do they work and to what extent can they tolerate these disturbing bags? Answering these questions is crucial since this type of noise is common in practice. The unveiling of their working mechanism shall shed light on future research direction, not limited to DSRE.

To achieve this, we propose a paradigm based on newly curated DSRE benchmark, BagRel-Wiki73K extracted from FewRel Han et al. (2018b) and Wikidata, for quantitative analysis of attention and KG. With extensive experiments, we conclude the following innovative and inspiring findings:

(1) The accuracy of attention is inversely proportional to the total noise ratio and disturbing bag ratio of training data; (2) attention effectively selects valid annotations by comparing their contexts with the semantics of relations, thus tends to rely more on the context to make predictions. However, it somehow lowers the model’s robustness to noisy sentences that do not express the relation; (3) KG-enhanced attention indeed improves RE performance, surprisingly not via enhanced attention accuracy, but by incorporating entity features to reduce the demand of contexts when facing noise; (4) attention could hurt the performance especially when there is no sufficient training data.

Based on the above observations, we propose a new straightforward yet effective model based on pre-trained BERT Devlin et al. (2018) for RE with Concatenated KG Embedding, namely BRE+CE. Instead of in-bag attention, it breaks the bag and ensembles the results of all sentences belonging to the bag. For each sentence, we directly incorporate entity embeddings into BERT, rather than to enhance attentions, to improve the robustness of extracting both context and mention features. BRE+CE significantly outperforms existing state-of-the-arts on two publicly available datasets, NYT-FB60K Han et al. (2018a) and GIDS-FB8K Jat et al. (2018), by 6% AUC on average. We summarize our contributions as follows:

  • To the best of our knowledge, our proposed framework is the first work to quantitatively analyze the working mechanism of Knowledge Graph and attention for bag-level RE.

  • We have conducted extensive experiments to inspire and support us with the above findings.

  • We demonstrate that a straightforward method based on the findings can achieve improvements on public datasets.

2 Related Work

To address the issue of insufficient annotations, Mintz et al. (2009) proposed distant supervision to generate training data automatically, which also introduces much noise. From then, DSRE becomes a standard solution that relies on multi-instance learning from a bag of sentences instead of a single sentence Riedel et al. (2010); Hoffmann et al. (2011). Attention mechanism Lin et al. (2016) accelerates this trend via strong ability in handling noisy instances within a bag Liu et al. (2017); Du et al. (2018). Aside from intra-bag attention, ye2019distant also designed inter-bag attention simultaneously handling bags with the same relation. To deal with only-one-instance bags, li2020self utilized a new selective gate (SeG) framework to independently assign weights to each sentence. External KG is also incorporated to enhance the attention module Han et al. (2018a); Hu et al. (2019). However, due to the lack of sentence-level ground truth, it is difficult to quantitatively evaluate the performance of the attention module. Previous researchers tend to provide examples as case study.222Shahbazi et al. (2020) claim to annotate each positive bag in NYT-FB60K, but haven’t published their code and dataset. Therefore, we aim to fill in this research gap by constructing a dataset and providing a framework for thorough analysis.

3 Preliminary

Knowledge Graph (KG) is a directed graph , where E denotes the set of entities, denotes the set of relation types in , and denotes the set of triples. KG embedding models, e.g., RotatE Sun et al. (2019)

, can preserve the structure information in the learned vectors

, and . We adopt TransE Bordes et al. (2013) in experiments.

Bag-level relation extraction (RE) takes a bag of sentences as input. Each sentence in the bag contains the same entity pair , where . The goal is to predict a relation between .

Attention-based Bag-level RE uses attention to assign a weight to each sentence within a bag. Given a bag from the dataset , an encoder is first used to encode all sentences from into vectors separately. Then, an attention module computes an attention weight for each sentence and outputs the weighted sum of as to denote :


where is the label embedding of relation in the classification layer, we denote this attention module as ATT in the rest of paper.

KG-enhanced attention aims to improve with entities and Han et al. (2018a):


where is regarded as latent relation embedding. We mark this way of computing as KA. and are learnable parameters.

Given a bag representation , the classification layer further predicts a confidence of each relation:



is a logit vector.

and are learnable parameters. During training, the loss is computed by:


where is the number of training bags in . Since the classification layer is linear, we can rewrite the bag’s logit vector using a weighted sum of each sentence’s logit vector :


From equation 10, we can see that the model’s output on the whole bag depends on three aspects: (1) the model’s output on valid sentences within the bag; (2) the model’s output on noisy sentences within the bag; (3) the attention weight assigned to valid sentences and noisy ones.

4 Benchmark

To quantitatively evaluate the effect of attention and KG on Bag-level RE, we first define two metrics to measure the noise pattern (Section 4.1). Then, we construct a KG and a Bag-level RE dataset (Section 4.2). Finally, we introduce a general evaluation framework to assess attention, KG and the entire RE model (Section 4.3).

4.1 Metrics Describing Noise Pattern

To analyze how attention module functions on different noise patterns, we first design 2 metrics to describe the noise pattern: Noise Ratio (NR) and Disturbing Bag Ratio (DR).

Noise Ratio (NR)

represents the proportion of noisy sentences in the dataset. Given a bag and its relation label , a sentence is noisy if its context does not express . Suppose is an indicator function to tell whether is noise. Then NR is defined as:


where is the size of , is the total number of bags.

Disturbing Bag Ratio (DR)

means the proportion of disturbing bags in the dataset. A bag is disturbing if all sentences in it are valid or all sentences are noisy. Formally, we use function to indicate whether a bag is disturbing or not:


Then we define DR as follows:


Figure 2: Left: Process of synthesizing the valid sentence with correct context and the noisy sentence with wrong context. Right: Visualization of different train sets of different noise patterns, the four sets from left to right are named as ,, and .

4.2 Dataset Construction

Based on FewRel and Wikidata, we construct a Bag-level RE dataset containing multiple training sets with different noise patterns, a test set and a development set. For each sentence in the bags, there is a ground truth attention label indicating whether it is a valid sentence or noise. We also construct a KG containing all entities in the RE dataset by retrieving one-hop triples from Wikidata.

Synthesize Sentence

FewRel is a sentence-level RE dataset, including 80 relations. For each relation, there are 700 valid sentences. Each sentence has a unique entity pair. Every sentence along with its entities and relation label form a tuple . We thus synthesize valid and noisy sentences for the same entity pair for data augmentation.

The first step is to divide sentences of each relation into 3 sets: , and , where each set has 500, 100 and 100 sentences. Then, for each tuple in the set, we aim to augment it to a bag , where all of its sentences contain . Besides, the sentences in are either the original , or a synthesized valid sentence, or a synthesized noisy sentence. We synthesize sentences in the form of , where denotes the attention label (1 for valid, 0 for noisy). In specific, to synthesize a sentence, we randomly replace the source pair of entity mentions with other target entity pairs while keeping the context unchanged. Thus, if the contexts express the same relation type with the entity pair, we can automatically assign an attention label.

We illustrate the synthesizing process in Figure 2. is a sentence from . To generate a valid sentence, we randomly select another sentence which is labeled with the same relation as from . Then we replace its entity mentions and as and . The output is . Since its context correctly describe crosses, we regard as valid. For the noisy sentence, we randomly select a sentence under another relation. With similar process for , we get a synthesize sentence . Because the context of does not express target relation, we label it as a noise.

Training Sets with Different Noise Patterns

As defined in Section 4.1, we use NR and DR to measure the noise pattern of Bag-level RE dataset. By controlling the number of synthesized noisy sentences in each bag and the total ratio of noise among all sentences, we can construct several training sets with different patterns. In the following sections, we denote a training set of which the NR is and DR is as . Higher and indicate noisy sentences and disturbing bags account for larger proportion.

For example, in Figure 2, assuming there are 4 sentences in , for each sentence, we synthesize two noisy sentences that form the bag together with the original sentence. Thus each bag contains 3 sentences: 1 valid and 2 noisy, and its NR is 2/3 and DR is 0. For the other 3 sets, the number of synthesized noisy sentences equals the sum of original valid sentences and synthesized valid sentences. Thus they all have a NR of 1/2. Since we define bags containing no valid sentences or no noisy sentences as disturbing bags, the third set and fourth set have 2 and 4 disturbing bags, with a DR of 1/2 and 1, respectively.

Test Set and Development Set

We also construct a test and a development set. Similar as the second set in Figure 2, each bag in the test/dev sets contains two sentences, the NR of both sets is 1/2 while the DR is 0. I.e., in every bag of test/dev sets, there is one valid sentence and one noisy sentence. Instead of multiple test sets of different noise patterns, we only have one test set so that the evaluation of different models is consistent. To avoid information leak, when construct , test and development sets, the context of synthesized sentences only come from , and , respectively.

The final BagRel contains 9 train sets, 1 test and 1 development set, as listed in Table 1. The NR of the training sets has three options: 1/3, 1/2 or 2/3, and similarly, DR can be 0, 1/2 or 1. The NR of both test and development sets are 1/2, while their DR are 0. All data sets contain 80 relations. For training sets whose NR are 1/3, 1/2 and 2/3, every bag in these sets contains 3, 2 and 3 sentences, respectively.

Dataset # Noisy Sentence # Sentence # Bag
40K 120K 40K
40K 80K 40K
80K 120K 40K
8K 16K 8K
8K 16K 8K
Table 1: Statistics of 11 sets of BagRel-Wiki73K, where denotes three sets of , , and .

KG Construction

To evaluate the impact of KG on attention mechanism, we also construct a KG based on Wikidata. Denoting the set of entities appearing in FewRel as , we link each entity in to Wikidata by its Freebase ID, and then extract all triples in Wikidata where . To evaluate the effect of structural information from KG, we also construct a random KG whose triple set is . Specifically, for each triple in , we corrupt it into by replacing with a random relation . Thus the prior knowledge within the KG is destroyed. KG-73K and KG73K-random have the same scale: 72,954 entities, 552 relations and 407,821 triples.

Finally, we obtain BagRel-Wiki73K, including the Bag-level RE sets and KG-73K.

4.3 Evaluation Framework

We first define several measurements to evaluate the effect of the attention mechanism and KG: Attention Accuracy (AAcc), Area Under precision-recall Curve (AUC), AUC on Valid sentences (AUCV) and AUC on Noisy sentences (AUCN).


measures the attention module’s ability to assign higher weights to valid sentences than noisy sentences. Given a non-disturbing bag (a bag containing both valid and noisy sentences)

and the predicted probability distribution

, the AAcc of this bag is calculated by the following formula:


where is the size of , I() is an indicator function which returns 1 or 0 if the input is True or False. By , we count how many valid-noisy sentence pairs contained in . With , we count how many pairs show higher weight on the valid sentence. Then the AAcc of the whole data set is computed as where n is the number of bags in the data set.

AAcc is designed specifically for non-disturbing bags. On disturbing bags, with all sentences noisy or valid, it is meaningless to evaluate attention module’s performance. So in test/dev sets of our BagRel-Wiki73k, all bags are non-disturbing bags. Then without distraction, the evaluation results can better present how the attention module works.


is a standard metric to evaluate DSRE model’s performance on bag-level test set. As mentioned in section 3, attention-based model’s performance on non-disturbing bags relies on three aspects: (1)AAcc, (2) model’s performance on valid sentences and (3) model’s performance on noisy sentences. So we use AUCV and AUCN to measure the second and the third aspects, respectively. The difference between AUC and AUCV is that AUC is computed on the original test set , while AUCV is AUC computed on the Valid-only test set . Compared with , has the same label but removes all noisy sentences within it. Thus there is no noisy context feature in , then models can utilize both entity mentions and contexts to achieve a high AUCV. On the opposite, AUCN is AUC computed on the Noise-only test set , where removes all valid sentences in . Since all context features in are noisy, to achieve a high AUCN, models have to ignore context and rely more on mention features to make predictions.

AUC, AUCV and AUCN range from to , and a higher value of the 3 metrics indicates that a model makes better prediction on the whole bag, valid sentences and noisy sentences, respectively.

5 Method

To evaluate the effects of attention and KG, we design two straightforward Bag-level RE models without the attention module, BRE and BRE+CE. By comparing their performance with BRE+ATT (BRE with attention module) and BRE+KA (BRE with KG-enhanced attention module), we can have a better understanding of the roles of ATT and Knowledge-enhanced ATT.

BRE uses BERT Devlin et al. (2018) as the encoder. Specifically, we follow the way described in Peng et al. (2020); Soares et al. (2019): entity mentions in sentences are highlighted with special markers before and after mentions. Then the concatenation of head and tail entity representations are used as the representation . Since BRE does not have attention mechanism, it breaks the bags and compute loss on each sentence:


BRE can be viewed as a special case of BRE+ATT. Its attention module assigns all sentences in all bags with the same attention weight 1. During inference, given a bag, BRE uses the mean of each sentence’s prediction as the whole bag’s prediction:


BRE+CE concatenates an additional feature vector with BERT output, where is defined based on entity embeddings of and . The concatenated vector is used as the representation of the sentence and fed into the classification layer.

6 Experiment

We apply our proposed framework on BagRel-Wiki73K and two real-world datasets to explore the following questions:

  • How noise pattern affects the attention module?

  • Whether attention mechanism promotes RE model’s performance?

  • How KG affects the attention mechanism?

  • Whether attention aggravates data sparsity?

6.1 Experimental Setup

For fair comparison, all of baselines share the same encoding structure as BRE. The attention-based models include BRE+ATT,BRE+KA and BRE+SeG, where SeG Li et al. (2020) is an advanced attention mechanism which achieves the state-of-the-art performance on NYT-FB60K. Briefly, SeG uses sigmoid instead of softmax to compute attention weights of each instance in a bag. The models without attention are BRE and BRE+CE. To check the effect of noise pattern, we train model on different train sets. As a reminder, is a train set whose NR and DR is and , respectively.

6.2 Noise Pattern v.s. Attention Accuracy

We train BRE+ATT on 9 different training sets with different noise patterns. As shown in Figure 3, we can see that: (1) higher noise ratio (NR) makes the model harder to highlight valid sentences, leading to a lower attention accuracy (AAcc); (2) higher disturbing bag ratio (DR) results in lower AAcc, indicating that disturbing bags challenge the attention module. Based on these results, we claim that the noise pattern within the training set largely affects the attention module’s effectiveness.

Figure 3: Attention accuracy (AAcc) on the test set of BagRel-Wiki73K. The results are collected with BRE+ATT trained on train sets of various noise patterns. The x axis denote train sets of different Disturbing bag Ratio (DR). The different colors indicate various Noise Ratio (NR).

6.3 Attention v.s. RE Performance

BRE- .910 NA .932 .850
BRE+ATT- .878 .881 .941 .434
BRE+ATT- .897 .751 .932 .711
BRE+ATT- .896 .713 .925 .759
Table 2: Test results of models trained on different train set. In the Model column, X-Y means model X trained on train set Y. Among 3 train sets, has the most disturbing bags, while has no such bag.

To quantitatively analyze the effect of attention mechanism, we compare the performance of BRE and BRE+ATT in Table 2, keeping other variables of the model unchanged. Particularly, a higher AUCV indicates the stronger ability of the model itself — in an ideal setting without any noise, and a higher AUCN indicates higher robustness of model to noise. Surprisingly, when using the same training set , the AUC of the attention-enhanced model is lower than the AUC of the model without attention ( v.s. ). In addition, BRE+ATT has lowest AUC using , which has no disturbing bags. The highest AAcc () also suggests that the attention module does effectively select valid sentences. Why the most effective attention module leads to the worst performance? The reason is that BRE+ATT- has a much lower AUCN, which indicates that it is less robust to noisy sentences.

Is it true that an effective attention module shall hurt model’s robustness to noise? This is actually against our intuition. To answer it, we draw Figure 4 by assigning fixed attention weights to sentences during training. Specifically, each bag in has a valid sentence and a noisy sentence, and we assign fixed attention weight to the valid and to the noisy one, instead of computing with attention module. Then we test the resulting model’s AUCN and AUCV performance. We can see that when the valid sentences receive higher attention weights, the AUCV curve rises slightly, indicating the model’s performance indeed gets enhanced. Meanwhile, the AUCN curve goes down sharply. This demonstrates the effective attention weakens the model’s robustness to noise. The reason is that the model with a high-performance attention module prefers to utilize context information instead of entity mention features. Thus, it usually fails if most contexts are noisy.

Figure 4: AUCV and AUCN results of BRE+ATT- trained with fixed attention weights.

Thus we can explain the results in Table 2. has the highest AAcc, indicating that it assigns very low weights to noisy sentences. Thus the gain from AUCV can not make up the loss from AUCN, resulting a worse AUC.

In conclusion, attention module can effectively select valid sentences during training and test. But it has an underlying drawback that it might hurt the model’s ability to predict based on entity mention features, which are important in RE tasks Li et al. (2020) Peng et al. (2020), leading to worse overall performance.

6.4 KG v.s. Attention

BRE+ATT- .878 .881 .941 .434
BRE+- .915 .762 .936 .659
BRE+KA- .932 .857 .936 .560
BRE+KA- .924 .720 .928 .723
BRE+KA- .913 .617 .916 .761
BRE+CE- .915 NA .935 .856
BRE+CE- .919 NA .939 .849
BRE+CE- .918 NA .941 .845
Table 3: Results of models trained on different train set. In the Model column, X-Y means model X trained on train set Y. BRE+ uses entity embeddings learned on KG-73K-random for the attention module.

To measure KG’s effect on the combined with attention mechanism, we compare the results of KA with ATT, while keeping other parts of the model unchanged. As shown in Table 3. When trained on , the KG-enhanced model (KA-) has lower AAcc than the model without KG (ATT-) ( v.s. ), while the AUC is higher ( v.s. ). This is because the KA version has a higher AUCN () and comparable AUCV and AAcc. Thus, the KG-enhanced model achieves better performance on noisy bags, leading to a better RE performance.

In addition, comparing Table 2 and Table 3

, KA shows lower AAcc and higher AUCN than ATT on all three train sets. This also demonstrates that KG does not promote model’s performance by improving attention module’s accuracy, but by enhancing the encoder and classification layer’s robustness to noisy sentences. This makes sense because the information from KG focuses on entities instead of contexts. By incorporating KG, the model relies more on entity mention features instead of noisy contexts feature, thus becomes better at classifying noisy sentences.

Moreover, comparing BRE+’s performance with BRE+KA on , we can observe that after incorporating entity embeddings learned from a random KG, BRE+ has a much lower attention accuracy. This indicates that misleading knowledge would hurt attention mechanism.

6.5 Attention v.s. Data Sparsity

Attention module assigns low weights to part of training sentences. When training data is insufficient, not making full use of all training examples could aggravate the data sparsity issue. Thus we compare performance of models trained on subsets of . From Figure 5, we can see that along with the decreasing size of training data, the performance gap between BRE+ATT and BRE+CE becomes larger. This is because the latter one fully utilizes every example by assigning the same weight 1 to all sentences. We also check each model’s attention weights. BRE+SeG assigns all sentences with weights , so its performance drop is similar to the model without attention. Thus, we claim that traditional attention mechanism could exacerbate the model’s ability to insufficient data. This motivates us a better attention mechanism for few-shot settings. We leave it in the future.

Figure 5: AUC test results of models trained on 4 subsets of BagRel-Wiki73K’s set. The 4 subsets contain 2%, 10%, 20% and 100% bags of set.

6.6 Stability of Attention v.s. Noise Pattern

From results in Table 2 and Table 3, we can see that the performance of BRE+CE is stable when the ratio of disturbing bags changes. In comparison, BRE+ATT and BRE+KA show varying results across different train sets. On which has the most disturbing bags, BRE+CE outperforms BRE+ATT and BRE+KA, demonstrating that BRE+CE could be a competitive method for Bag-level DSRE.

6.7 Results on Real-world Datasets

JointE .408 .912
RELE .497 .905
SeG .451 .913
BRE+ATT .457 .917
BRE+KA .480 .917
BRE .625 .910
BRE+CE .630 .917
Table 4: AUC on NYT-FB60K and GIDS-FB8K.

Figure 6: Precision/recall curves on NYT-FB60K

Based on previous observations, we find that BRE and BRE+CE could avoid latent drawbacks of attention mechanism and have a stable performance on datasets with different noise patterns, thus they are competitive methods compared with prior baselines. To examine whether they work on the real-world Bag-level DSRE datasets, we compare our method to 3 previous baselines on NYT-FB60K Han et al. (2018a) and GIDS-FB8K Jat et al. (2018). We select JointE Han et al. (2018a), RELE Hu et al. (2019) and SeG Li et al. (2020)

as baselines, because they achieve state-of-the-art performance on bag-level RE. To collect AUC results, we carefully re-run published codes of them using suggested hyperparameters from the original papers. We also draw precision-recall curves following prior works. As shown in Table

4 and Figure 6, our method BRE+CE largely outperforms existing methods on NYT-FB60K and has comparable performance on GIDS-FB8K. Such result demonstrates that we avoid attention mechanism’s latent drawback of hurting model’s robustness. Furthermore, the model’s improvement on NYT-FB60K is promising (around 13% AUC). This is due to two reasons: (1) NYT-FB60K is a noisy dataset containing prevalent disturbing bags, which is similar to our synthesized datasets. (2)NYT-FB60K is highly imbalanced and most relation types only have limited training data, while all relation types in our balanced datasets have the same number of training examples; thus BRE+CE and BRE achieve much higher improvement on NYT-FB60K compared with synthesized datasets. In conclusion, the high performance not only validates our claim that attention module may not perform well on noisy and insufficient training data, but also verifies that our thorough analysis on attention and KG have practical significance.

6.8 Effect of KG

Model BagRel NYT GIDS
BRE+ATT .878 .457 .917
BRE+KA .932 .480 .917
BRE .910 .625 .910
BRE+CE .915 .630 .917
Table 5: AUC test results of models on BagRel-Wiki73K, NYT-FB60K and GIDS-FB8K. In the BagRel column, all models are trained on .

From results in Table 5, we provide a straight comparison between models with KG (BRE+KA, BRE+CE) and models without KG (BRE+ATT, BRE). Apparently, both methods of utilizing KG (combined with attention and concatenated as additional features) outperforms methods not using KG. This demonstrates the prior knowledge from KG is beneficial for relation extraction task. Except our naive BRE+CE, we expect that a carefully designed mechanism incorporating KG can lead to higher improvement. We leave it in the future.

7 Conclusion

In conclusion, we construct a set of datasets and propose a framework to quantitatively evaluate how attention module and KG work in the bag-level RE. Based on the findings, we demonstrate the effectiveness of a straightforward solution on this task. Experiment results well support our claims that the accuracy of attention mechanism depends on the noise pattern of the training set. In addition, although effectively selecting valid sentences, attention mechanism could harm model’s robustness to noisy sentences and aggravate the data sparsity issue. As for KG’s effects on attention, we observe that it promotes model’s performance by enhancing its robustness with external entity information, instead of improving attention accuracy.

In the future, we are interested in developing a more general evaluation framework for other tasks, such as question answering, and improving the attention mechanism to be robust to noise and insufficient data, and an effective approach to incorporate the KG knowledge to guide the model training.


This research/project is supported by NExT Research Centre. This research was also conducted in collaboration with SenseTime. This work is partially supported by A*STAR through the Industry Alignment Fund - Industry Collaboration Projects Grant, by NTU (NTU–ACE2020-01) and Ministry of Education (RG96/20), and by the National Research Foundation, Prime Minister’s Office, Singapore under its Energy Programme (EP Award No. NRF2017EWT-EP003-023) administrated by the Energy Market Authority of Singapore.


  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Neural Information Processing Systems (NIPS), pp. 1–9. Cited by: §3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §5.
  • J. Du, J. Han, A. Way, and D. Wan (2018) Multi-level structured self-attentions for distantly supervised relation extraction. arXiv preprint arXiv:1809.00699. Cited by: §1, §2.
  • X. Han, Z. Liu, and M. Sun (2018a) Neural knowledge acquisition via mutual attention between knowledge graph and text. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 32. Cited by: §1, §1, §1, §2, §3, §6.7.
  • X. Han, H. Zhu, P. Yu, Z. Wang, Y. Yao, Z. Liu, and M. Sun (2018b) Fewrel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. arXiv preprint arXiv:1810.10147. Cited by: §1.
  • R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld (2011) Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 541–550. Cited by: §2.
  • L. Hu, L. Zhang, C. Shi, L. Nie, W. Guan, and C. Yang (2019) Improving distantly-supervised relation extraction with joint label embedding. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    pp. 3812–3820. Cited by: §1, §2, §6.7.
  • S. Jat, S. Khandelwal, and P. Talukdar (2018) Improving distantly supervised relation extraction using word and entity based attention. arXiv preprint arXiv:1804.06987. Cited by: §1, §6.7.
  • G. Ji, K. Liu, S. He, and J. Zhao (2017) Distant supervision for relation extraction with sentence-level attention and entity descriptions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: §1.
  • Y. Li, G. Long, T. Shen, T. Zhou, L. Yao, H. Huo, and J. Jiang (2020) Self-attention enhanced selective gate with entity-aware embedding for distantly supervised relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8269–8276. Cited by: §1, §6.1, §6.3, §6.7.
  • Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun (2016) Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2124–2133. Cited by: §1, §2.
  • T. Liu, K. Wang, B. Chang, and Z. Sui (2017) A soft-label method for noise-tolerant distantly supervised relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1790–1795. Cited by: §2.
  • M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011. Cited by: §1, §2.
  • H. Peng, T. Gao, X. Han, Y. Lin, P. Li, Z. Liu, M. Sun, and J. Zhou (2020) Learning from context or names? an empirical study on neural relation extraction. arXiv preprint arXiv:2010.01923. Cited by: §5, §6.3.
  • S. Riedel, L. Yao, and A. McCallum (2010) Modeling relations and their mentions without labeled text. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    pp. 148–163. Cited by: §2.
  • H. Shahbazi, X. Z. Fern, R. Ghaeini, and P. Tadepalli (2020) Relation extraction with explanation. arXiv preprint arXiv:2005.14271. Cited by: footnote 2.
  • L. B. Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski (2019) Matching the blanks: distributional similarity for relation learning. arXiv preprint arXiv:1906.03158. Cited by: §5.
  • Z. Sun, Z. Deng, J. Nie, and J. Tang (2019) Rotate: knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197. Cited by: §3.
  • D. Zeng, K. Liu, Y. Chen, and J. Zhao (2015)

    Distant supervision for relation extraction via piecewise convolutional neural networks

    In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1753–1762. Cited by: §1.
  • N. Zhang, S. Deng, Z. Sun, G. Wang, X. Chen, W. Zhang, and H. Chen (2019) Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. arXiv preprint arXiv:1903.01306. Cited by: §1.