H-FND: Hierarchical False-Negative Denoising for Distant Supervision Relation Extraction

12/07/2020 ∙ by Jhih-Wei Chen, et al. ∙ The Regents of the University of California Academia Sinica 0

Although distant supervision automatically generates training data for relation extraction, it also introduces false-positive (FP) and false-negative (FN) training instances to the generated datasets. Whereas both types of errors degrade the final model performance, previous work on distant supervision denoising focuses more on suppressing FP noise and less on resolving the FN problem. We here propose H-FND, a hierarchical false-negative denoising framework for robust distant supervision relation extraction, as an FN denoising solution. H-FND uses a hierarchical policy which first determines whether non-relation (NA) instances should be kept, discarded, or revised during the training process. For those learning instances which are to be revised, the policy further reassigns them appropriate relations, making them better training inputs. Experiments on SemEval-2010 and TACRED were conducted with controlled FN ratios that randomly turn the relations of training and validation instances into negatives to generate FN instances. In this setting, H-FND can revise FN instances correctly and maintains high F1 scores even when 50 further conducted to shows that H-FND is applicable in a realistic setting.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Relation extraction Zelenko et al. (2003); Mooney and Bunescu (2006); Zhou et al. (2005)

is a core task in information extraction. Its goal is to determine the relation between two entities in a given sentence. For instance, given the sentence “Jobs was born in San Francisco”, with head and tail entities “Jobs” and “San Francisco”, the relation to be extracted is “Place of Birth”. Relation extraction can be applied for many applications, such as question answering and knowledge graph completion.

A major difficulty with supervising relation extraction models is the cost of collecting training data, against which distant supervision (DS) Hoffmann et al. (2011); Surdeanu et al. (2012) is proposed. DS obtains the relational facts from a knowledge base and aligns these facts to all sentences in the corpus to generate learning instances. In specific, if a relation triple exists in a knowledge base, then for a sentence which mentions both the head entity and the tail entity , it is tagged with relation to form a learning instance .

Knowledge base Relation
Steve Jobs, San Francisco PoB
Corpus Relation Type
Jobs was born in San Francisco PoB (✓) TP
Jobs moved back to San Francisco PoB (✗) FP
Manuela was born in New York NA (✗) FN
Table 1: Distant supervision and different types of incorrectly labeled relations. The head and tail entities are shown in boldface, and “PoB” stands for the relation “Place of Birth”.

Although datasets for relation extraction can be generated using distant supervision, they contain considerable noise Roth et al. (2013). Under distant supervision, there are two types of noisy instances: false positives (FP) and false negatives (FN). Table 1 shows an example. The FP “Jobs moved back to San Francisco” should not reflect the relation ‘Place of Birth’. Also, an FN: as there is no relation between “Manuela” and “New York” in the knowledge base, “Manuela was born in New York” is wrongly labeled as a non-relation (NA) under the closed world assumption. Both FP and FN degrade model performance if they are treated as correct labels at training time. FPs harm prediction precision, while excessive FNs lead to low recall rates.

In addition to denoising methods for learning robustly with noisy data Han et al. (2018); Northcutt et al. (2019), many works focus on alleviating the FP problem in DS datasets, including those on pattern-based extraction Alfonseca et al. (2012); Jia et al. (2019), multiple-instance learning Surdeanu et al. (2012); Lin et al. (2016); Zeng et al. (2018)

, and sentence-level denoising with adversarial training or reinforcement learning 

Qin et al. (2018a, b); Feng et al. (2018). However, few investigate the FN problem for distant supervision Xu et al. (2013); Roller et al. (2015)

. To the best of our knowledge, there is no previous study on this problem for deep neural networks.

In this paper, we investigate the impact of FNs on neural-based models and propose H-FND, a hierarchical false-negative denoising framework for robust distant supervision. Specifically, this framework integrates a deep reinforcement learning agent which keeps, discards, or revises probable FN instances with a relation classifier to generate revised relations. In addition, to constrain the study to the FN problem and to construct ground-truth relations to further analyze model behavior, we conduct our research on the following two human-annotated datasets: SemEval-2010 

Hendrickx et al. (2010) and TACRED Zhang et al. (2017), with controlled FN ratios that randomly flip relations of training/validation instances into negatives to generate FN instances. Then, we further conduct our experiment on a distantly supervised dataset NYT10 Riedel et al. (2010) and fix its positive set, to demonstrate that our framework is applicable for resolving FN problem in a realistic setting. In summary, our contributions are three-fold:

  • We propose a denoising framework focused on false negatives in relation extraction.

  • We present a special transfer learning scheme for pretraining denoising agent as training data is not available for this pretraining task.

  • We show that our method revises correctly and maintains high F1 scores even under a high percentage of false negatives, and is applicable in a realistic setting.

We organize the rest of this paper as follows: Section 2 discusses the related work, Section 3 describes our H-FND framework, Section 4 shows the experimental results, and Section 5 concludes.

Related Work

Mintz et al. (2009) propose distant supervision (DS) to automatically generate labeled data for relation classification, a new paradigm that synthesizes positive training data by aligning a knowledge base to an unlabeled corpus, and produces negatives with a closed-world assumption. Although this method requires no human effort for sentence labeling, it introduces FPs and FNs into the generated data and degrades the performance of relation extraction models.

Many previous works have attempted to solve the FP problem. Among these works, denoising methods that utilize reinforcement learning (RL) are the most relevant to ours. Feng et al. (2018) propose a sentence-level denoising mechanism that trains a positive instance selector using RL, and set the RL reward to the prediction probability of the relation classifier. Qin et al. (2018b) also utilizes RL, but in a different way. It learns a denoising agent to redistribute FPs to NA via prediction accuracy of the classifier as the RL reward.

To solve the FN problem, one method is to align the KB to the corpus after performing KB completion using inference Roller et al. (2015). Although this does reduce the number of FNs in DS datasets, it helps little when the FN relations cannot be inferred from the KB, e.g., the entities mentioned in the FN are not in the KB. IRMIE Xu et al. (2013)

, another method, constructs a negative set in a more conservative sense, in which the head or tail entities have already participated in other relation triples in the KB. Other sentences outside the positive and negative sets are left unlabeled (labeled as RAW in original paper) to prevent FNs. After training on the positive and negative sets, positive relation triples are retrieved from the unlabeled set to expand the KB, after which the original DS is performed to improve the quality of relation extraction. The final performance of this method depends heavily on the heuristic for constructing the negative set, which may not be applicable for all possible relation types.

To address the FN problem in DS datasets more generally, we propose a hierarchical denoising method to mitigate the negative effect of FNs, ensuring a more robust relation extraction model when the presence of FN instances is unavoidable.

Figure 1:

H-FND framework. The process in this diagram is executed per epoch.

H-FND Framework

We propose H-FND, a hierarchical false-negative denoising framework that determines whether to keep, discard, or revise negative instances. As illustrated in Fig. 1, H-FND is composed of the denoising agent and relation classifier modules. The denoising agent makes a ternary decision on the action to take on each negative instance, and after discarding, the relation classifier predicts a new relation for each to-be-revised instance to produce a cleaned dataset.

Convolutional Neural Network

Convolutional neural networks (CNN) are commonly adopted for sentence-level feature extraction Kim (2014) in language understanding tasks, such as relation extraction  Zeng et al. (2014); Nguyen and Grishman (2015). PCNNs Zeng et al. (2015)

, a variation of CNN that applies piecewise max pooling, are also widely used for extracting sentence features 

Lin et al. (2016); Qin et al. (2018a). We included both as the base model in our experiments to show that our framework is base model agnostic. In our implementation, the extracted features of a learning instance

are fed into a fully connected softmax classifier to compute the final logits:

For detailed mathematical descriptions of CNN and PCNN, please refer to the Appendix.

Hierarchical Denoising Policy

The proposed hierarchical denoising policy is a framework using policy-based reinforcement learning (RL). Previous work utilizing RL to suppress noise from FPs Feng et al. (2018); Qin et al. (2018b) can be categorized in two types of strategies: the first decides whether to remove the input instance, and the second decides whether to revise the input instance to be negative. Both policies make a binary decision on each input instance, and successfully reduce FP instances in DS datasets.

While applicable on the FP problem, it is risky to directly apply these strategies on the FN problem. First, discarding a negative instance even when it is most likely positive can result in a loss of useful learning instances. Second, changing a negative instance to positive is not enough for the training process: we must also know which type of positive relation to revise to.

Therefore, we propose a hierarchical denoising policy to perform the FN denoising in two steps. The first step, a soft policy that combines the two above-mentioned denoising methods, is an agent that takes an action from the action set {Keep, Discard, Revise} for a negative instance :

  • Keep: maintain as a negative instance for training/validation;

  • Discard: remove to prevent it from misleading the model;

  • Revise: predict a new relation type for and treat it as a positive for the following training/validation.

The policy of this ternary decision is calculated based on the sentence feature extracted from with the base model CNN encoder:

each action has the possibility of of being taken by the denoising agent.

Then, if the negative instance is to be revised, the hierarchical policy goes on to the second step and gives the revised relation by selecting the most likely relation (excluding NA) predicted by the relation classifier:

Figure 2: A special transfer learning scheme for H-FND pretraining. Symbols “P” and “N” represent positive and negative instances for relation classifier pretraining. Symbols “O” and “X” indicate two sets of training instances which are correctly predicted and wrongly predicted by pretrained relation classifier correspondingly.


Supervised pretraining Qin et al. (2018b), commonly used to accelerate RL agent training, is easily performed for the relation classifier on the original DS dataset Han et al. (2018). For the denoising agent, however, there is no available training data. Therefore, we propose a special transfer learning scheme that utilizes the learnt knowledge in the relation classifier (source domain) to help generate action labels for pretraining denoising agent (target domain). (See Fig. 2).

First, we select the positives for which the pretrained relation classifier correctly predicts the relation, and tag these with Revise. This prepares the denoising agent to identify positive instances in the negative set in future training, and then pass these kinds of instances to the relation classifier to predict the correct positive relations for them. Similarly, we tag with Keep those negatives correctly predicted by the relation classifier. Lastly, for instances in which the relation classifier wrongly predicts their relation, we tag them with Discard, encouraging the denoising agent to discard such instances to avoid incorrect revisions.

In summary, our pretraining strategy is thus:

  1. Relation classifier pretraining

    : pretrain the relation classifier (RC) directly on the original training set with the categorical loss function:

    where represents the distantly supervised relation in the training set. Then, fix the parameters of the relation classifier for the next step.

  2. Label generation: generate labels with the predictions of the relation classifier.

  3. Denoising agent pretraining: Supervise the denoising agent (DA) with categorical loss:


To combine the training of the relation classifier and the denoising agent, we propose the following co-training framework during each epoch (see Fig. 1):

  1. Denoising agent decision: At the beginning of each epoch, the denoising agent first executes the denoising policy on the dataset. For both training and validation sets, the policy keeps, discards, or revises NA instances.

  2. Relation classifier revision: For instances to be revised, the relation classifier generates revision relations for them. Denoising yields the cleaned training and validation sets.

  3. Relation classifier training: Given the cleaned training set, we train the relation classifier in a supervised fashion based on categorical loss:

    where represents the modified training set, which contains all the positives and the kept or revised negatives. Note that discarded negatives are not included in .

  4. Reward determination: We evaluate the trained relation classifier on the cleaned validation set to obtain the F1 score, which we use as reward for denoising. As the validation set is cleaned by the denoising policy, reflects the efficacy of the policy.

  5. Denoising policy update: To maximize the reward , we adopt policy gradient Sutton et al. (2000) to optimize the denoising agent by maximizing the objective function :

    where is the parameter of the denoising policy, represents the softmax probability of the sampled determination or revision step, and

    is the baseline which mitigates the high variance of the REINFORCE algorithm 

    Williams (1992). We set to the average reward of the previous five epochs.

For each epoch, we obtain the revised set from the original training/validation set via the denoising policy, and H-FND finds the best denoising policy adaptively between supervised training and reward maximization.

Figure 3:

CNN and PCNN results on SemEval and TACRED, where the errorbars represent the standard deviations. The denoising method cleanlab and our method H-FND perform the best, but cleanlab requires a given noise rate of data, while H-FND does not requires such information.

Figure 4: CNN and PCNN ablation analysis on SemEval and TACRED, where the errorbars represent the standard deviations.


In order to quantify our model’s performance on denoising false negatives. We first evaluated the proposed H-FND on human-annotated datasets SemEval-2010 Hendrickx et al. (2010) and TACRED Zhang et al. (2017) with controlled FN ratios. Then, we evaluate H-FND on a DS dataset NYT10 Riedel et al. (2010) to evaluate its performance in a more realistic setting.

Table 2 shows the statistics of each dataset used in the experiments. For more information of the datasets and the preprocessing procedure, please refer to Appendix.

Datasets #training #validation #testing
SemEval 6,599 1,154 2,717
TACRED 63,782 20,088 15,509
NYT10 477,454 120,318 194,328
Table 2: Number of instances in each dataset

Baselines and Experiment Settings

A simple H-FND baseline was the original CNN and PCNN relation classifier. To demonstrate the impact of FNs, we also included SelATT Lin et al. (2016), an FP noise resistant model.

We further compared our H-FND framework with the following strong baselines: the FN denoising method IRMIE Xu et al. (2013) and two other general-purpose denoising methods: co-teaching Han et al. (2018) and cleanlab Northcutt et al. (2019)

. Co-teaching is a general training method for deep neural networks to combat extremely noisy labels. It simultaneously maintains two networks (each with the same structure), each of which samples its small-loss instances with a given overall noise rate as clean batches to its peer networks for further training. Cleanlab is a state-of-the-art robust learning method which directly estimates the joint distribution of noisy observed labels and latent uncorrupted labels with a consistent estimator, filters out noisy instances based on this joint distribution, and trains the relation classifier on the cleaned dataset with co-teaching mentioned above. We use these denoising methods to train the base CNN and PCNN models on our simulated FN datasets.

111The IRMIE KB was reconstructed from the positives of the simulated FN dataset.

As the focus of this paper is on the FN problem, and therefore all the positives of the simulated FN datasets are kept error-free, the H-FND framework assumes that no positives need be changed. Hence, for a fair comparison, we kept the positive sets of the FN datasets unchanged for the two general-purpose denoising methods, preventing them from discarding error-free positives. Also, we fix the positive set of NYT10 to evaluate the applicability of H-FND of resolving FN problem in a realistic setting.

In the experiments on SemEval and Tacred, every data point is the average of five independent runs. In the experiment of NYT10, some RL training is not stable, which might resulte from the excessive amount of FPs in NYT10. For a fair comparison, the included data points are the average of three best results out of five independent runs for H-FND and the baselines. See Appendix for more detailed information on experiment and model implementation.

Quantitative Results

The quantitative SemEval results are shown in the upper part of Fig. 3, including both CNN and PCNN. Under the 50% FN ratio, for both the base CNN and PCNN models, with or without SelATT, the F1 scores are heavily influenced by FN sentences: the performance drops by nearly 20%. ERMIE and co-teaching enhance the performance by more than 5% and 8% correspondingly. Except for cleanlab, H-FND denoising remains competitive to the baselines for FN ratios from 10% to 30%, and significantly wins after 30%. Among all baselines, cleanlab’s performance is the strongest and is competitive with our approach, but as cleanlab relies on a co-teaching model to train the relation classifier, a given noise rate is required. In our experiments, these are directly provided to the model. However, in practice, the noise rate (the FN ratios in our experiment) is unknown and must be estimated correctly, entailing extra effort. In contrast, H-FND has no such requirement.

The quantitative results on TACRED are shown in the lower part of Fig. 3. CNN, PCNN, and the two models with SelATT are all vulnerable to FN instances. As IRMIE fails to exclude enough FNs from the negative set on TACRED,222The size of the RAW set is less than 10% of the original negative set under all FN ratios. its performance is also strongly influenced by FN instances. Although the F1 scores of H-FND are 2% behind co-teaching and cleanlab for FN ratios from 0% to 20%, it successfully maintains its performance when the FN ratio exceeds 30% and becomes competitive with these two baselines. This is similar to the experimental results on SemEval for FN ratios less than 30%. Together with the fact that TACRED has many more positives than SemEval, we increased the FN ratio to 90%. The result of this extended experiment shows that when the FN ratio exceeds 60%, the F1 scores for co-teaching drop significantly, whereas H-FND maintains a relatively high F1 score. Here, again, although cleanlab performs similar to ours with the pre-defined FN ratios,333We have measured the performance of cleanlab when it was provided with a wrong FN ratio - 40% FN ratio. Under the actual FN ratio of 80% , its F1 scores dropped by 0.5% for CNN and 1.8% for PCNN. the proposed approach needs no such information, which better fits real-world circumstances of distant-supervised relation classification.

Ablation Study

Fig. 4 shows the result of the ablation study to justify the effectiveness of the Revise action and pretraining strategy. On Semeval, pretraining boosts the F1 score for the PCNN architecture for FN ratios from 10% to 40%, but yields no significant difference for the other ratios. On TACRED, however, the Revise action and the pretraining strategy clearly yield improved results. This improvement is substantial in particular for pretraining. As TACRED has more positive relation types and a much larger negative set, the FN denoising problem is more severe than on SemEval; thus the pretraining strategy is crucial to provide a better initial point for the denoising agent and to ensure more stable performance.

Figure 5: Denoising policy distribution on true negatives and false negatives.
Figure 6: Precision-recall curve on the NYT dataset. The shaded areas indicate one standard deviation. The precision rate of each algorithm run drops to zero at certain recall rate, hence the steep drops in the curves.

Detailed Analysis

We first analyzed the distribution of the denoising policy for TN and FN instances in the training set. Figure 5 shows the percentage of kept, discarded, or revised training instances. The left histogram under each filter ratio is for TN; the right is for FN.

On SemEval, we observe that for TN instances, H-FND mainly keeps them as NA and revises only a small portion to the wrong relation, even under the 50% filter ratio. For FN instances, H-FND prefers to discard or revise them. This difference shows that H-FND distinguishes FN instances from TN instances, and does not take arbitrary actions on them.

On TACRED, the policy distribution also shares the same tendency, but the portion of kept instances is generally larger. This is due to a higher ratio of negative instances in TACRED. As more negative instances result in more Keep labels in the generated pretraining data, after pretraining, the probability of the model taking the Keep action is generally higher. It also explains that the portion of kept instances grows when the filter ratio is raised. Note that this prevents H-FND from revising too many instances at the beginning of co-training, making co-training more stable.

Table 3 show the correctness of revisions on FN instances which are determined to be revised. The accuracy is around 90% for both CNN and PCNN architectures and for both SemEval and TACRED. This shows that H-FND accurately corrects FN instances once they are identified and determined to be revised in the first stage.

SemEval 10% 20% 30% 40% 50%
CNN 88.81 88.07 88.72 86.25 85.94
3.55 7.80 1.04 1.53 1.46
PCNN 89.54 88.04 86.83 90.31 84.17
1.98 3.12 2.11 0.54 3.56
TACRED 10% 20% 30% 40% 50%
CNN 91.43 90.62 90.89 91.65 93.39
0.72 0.99 0.63 0.99 1.79
PCNN 90.99 89.64 87.15 86.75 86.15
0.82 0.39 0.49 0.60 1.16
Table 3: Revision accuracy (%)

Results on Realistic Dataset

Lastly, we evaluated H-FND on NYT10 to gain an understanding of our framework’s performance on real DS datasets. For baselines, apart from the base model, we included cleanlab, as it is the best performing baseline in the controlled FN experiments. In the training set of NYT10, we conducted human evaluation on 200 randomly sampled instances and came to an estimate of 14% of FN in the negative instances.

We followed Zeng et al. (2015) and plotted the precision-recall curve to demonstrate the result on NYT10 (see Fig. 6). At recall rate lower than 40% cleanlab performs slightly worse than the base model, while H-FND remains competitive in terms of precision. This could be a result of inaccuracies in the estimation of FN rate in the dataset. Since H-FND does not require a given FN rate, it is not encumbered by such estimation error. At higher recall rates ( 50%), H-FND retains significantly higher precision. This result shows that H-FND is applicable for real DS datasets, especially when the recall rate matters.

Conclusion and Future Work

In this work, to increase the robustness of distant supervision, we present H-FND, a hierarchical false-negative denoising framework, which keeps, discards, or revises non-relation (NA) inputs during training and validation phases to suppress noise from FN instances and yield a clean dataset for relation classifiers to learn from. We also present a special transfer learning scheme for pretraining the denoising agent.

To investigate the effects of FN instances addressed by our approach, we generate FN instances from SemEval-2010 and TACRED by replacing relations of instances with NA under controlled ratios. The experimental results show that H-FND revises FN instances to the appropriate relations and facilitates robust relation extraction. Further experiment on NYT10 demonstrates that our framework is applicable to real world DS denoising. This framework can be applied on tasks such as knowledge base enrichment task, where a large corpus is aligned to an incomplete knowledge base.

For large distant supervised corpora, both FP and FN instances may emerge simultaneously. Both of which should be addressed for a optimal results. This is a challenging but very practical setting. We leave this as future work. Also, we plan to attempt other advanced relation classification approach like R-BERT Wu and He (2019) to replace CNN or PCNN in our architecture.

The source code will be released on https://github.com


  • P. Adam, G. Sam, C. Soumith, C. Gregory, Y. Edward, D. Zachary, L. Zeming, D. Alban, A. Luca, and L. Adam (2017)

    Automatic differentiation in PyTorch


    Neural Information Processing Systems workshop: The future of gradient-based machine learning software & techniques

    Cited by: Appendix A.
  • E. Alfonseca, K. Filippova, J. Delort, and G. Garrido (2012) Pattern learning for relation extraction with a hierarchical topic model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Cited by: Introduction.
  • J. Feng, M. Huang, L. Zhao, Y. Yang, and X. Zhu (2018) Reinforcement learning for relation classification from noisy data. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: item 2, Introduction, Related Work, Hierarchical Denoising Policy.
  • B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pp. 8527–8537. Cited by: Appendix A, Introduction, Pretraining, Baselines and Experiment Settings.
  • I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. Ó Séaghdha, S. Padó, M. Pennacchiotti, L. Romano, and S. Szpakowicz (2010)

    SemEval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 33–38. Cited by: Introduction, Experiment.
  • R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld (2011) Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 541–550. Cited by: Introduction.
  • M. Honnibal and M. Johnson (2015) An improved non-monotonic transition system for dependency parsing. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    pp. 1373–1378. Cited by: Appendix A.
  • W. Jia, D. Dai, X. Xiao, and H. Wu (2019) ARNOR: attention regularization based noise reduction for distant supervision relation classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1399–1408. Cited by: Introduction.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Cited by: Convolutional Neural Network.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: Appendix A.
  • Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun (2016) Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 2124–2133. Cited by: item 1, item 2, Introduction, Convolutional Neural Network, Baselines and Experiment Settings.
  • M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011. Cited by: item 1, Related Work.
  • R. J. Mooney and R. C. Bunescu (2006) Subsequence kernels for relation extraction. In Advances in neural information processing systems, pp. 171–178. Cited by: Introduction.
  • T. H. Nguyen and R. Grishman (2015) Relation extraction: perspective from convolutional neural networks. In

    Proceedings of the NAACL Workshop on Vector Space Modeling for Natural Language Processing

    Cited by: Appendix A, Convolutional Neural Network.
  • C. G. Northcutt, L. Jiang, and I. L. Chuang (2019) Confident learning: estimating uncertainty in dataset labels. arXiv preprint arXiv:1911.00068. Cited by: Introduction, Baselines and Experiment Settings.
  • P. Qin, W. Xu, and W. Y. Wang (2018a) DSGAN: Generative adversarial training for distant supervision relation extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Cited by: Introduction, Convolutional Neural Network.
  • P. Qin, W. Xu, and W. Y. Wang (2018b) Robust distant supervision relation extraction via deep reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Cited by: item 2, Introduction, Related Work, Hierarchical Denoising Policy, Pretraining.
  • S. Riedel, L. Yao, and A. McCallum (2010) Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Cited by: Introduction, Experiment.
  • R. Roller, E. Agirre, A. Soroa, and M. Stevenson (2015) Improving distant supervision using inference learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 273–278. Cited by: Introduction, Related Work.
  • B. Roth, T. Barth, M. Wiegand, and D. Klakow (2013) A survey of noise reduction methods for distant supervision. In Proceedings of the 2013 workshop on Automated knowledge base construction, pp. 73–78. Cited by: Introduction.
  • M. Surdeanu, J. Tibshirani, R. Nallapati, and C. D. Manning (2012) Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 455–465. Cited by: Introduction, Introduction.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: item 5.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: item 5.
  • S. Wu and Y. He (2019) Enriching pre-trained language model with entity information for relation classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2361–2364. Cited by: Conclusion and Future Work.
  • W. Xu, R. Hoffmann, L. Zhao, and R. Grishman (2013) Filling knowledge base gaps for distant supervision of relation extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 665–670. Cited by: Introduction, Related Work, Baselines and Experiment Settings.
  • D. Zelenko, C. Aone, and A. Richardella (2003) Kernel methods for relation extraction. Journal of machine learning research 3, pp. 1083–1106. Cited by: Introduction.
  • D. Zeng, K. Liu, Y. Chen, and J. Zhao (2015) Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Cited by: item 3, Convolutional Neural Network, Results on Realistic Dataset.
  • D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao (2014) Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 2335–2344. Cited by: Convolutional Neural Network.
  • X. Zeng, S. He, K. Liu, and J. Zhao (2018) Large scaled relation extraction with reinforcement learning.. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 5658–5665. Cited by: Introduction.
  • Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning (2017) Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 35–45. Cited by: Introduction, Experiment.
  • G. Zhou, J. Su, J. Zhang, and M. Zhang (2005) Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting of the association for computational linguistics, pp. 427–434. Cited by: Introduction.

Appendix A Appendices

Convolutional Neural Network

We use a convolutional neural network (CNN) Nguyen and Grishman (2015) as our base model for both the denoising agent and the relation classifier. This architecture consists of four main layers (the first three layers compose the CNN encoder):

  1. Embedding: The embedding layer transforms a word into a vector representation, which is a concatenation of a word embedding and a pair of positional embedding vectors ,  Lin et al. (2016). Word embedding is a vector that represents the semantics of a word, and positional embedding pair is two vectors representing the relative distance from the current word to two entities in the sentence.

    The final embedding vector of dimension for each word is the concatenation of , , and :

  2. Convolution: The convolutional layer transforms the embedding vectors of words into local features by applying sliding filters over them. Each filter consists of a weight matrix and a bias term , to extract specific patterns in the embedding vectors. With filters of length , the entry in the feature map for the -th filter at position is

    where is the length of the input sentence. To capture information expressed in phrases of all lengths, we further use different lengths of filters, and concatenate all under filter size as the jointed feature map :

  3. Max pooling: The max pooling layer captures the most significant feature into the pooling feature by selecting the highest value in the feature map extracted by the -th filter over all positions:

    PCNN Zeng et al. (2015) involves piecewise max pooling, which better suits the relation extraction task. It divides an input sentence into three segments based on the two selected entities, and then extracts features from all the three segments to capture fine-grained features for relation extraction. For PCNN, the extracted feature map

    where , , and are the three feature map segments separated by the two selected entities. We also view as the sentence feature, as it represents the essential features of the whole sentence.

  4. Fully connected: The fully connected layer (FC) performs relation classification based on sentence feature with softmax activation over each relation. The computed logits is written as

#(Params) SemEval TACRED NYT
RC CNN 1,318,130 1,347,602 1,388,933
PCNN 1,336,530 1,424,882 1,486,453
RC+SelATT CNN 1,327,330 1,386,242
PCNN 1,364,130 1,540,802
DA CNN 1,311,683 1,311,683 1,342,883
PCNN 1,317,203 1,317,203 1,348,403
Table 4: Number of trainable parameters in each model.


  1. Human-Annotated Datasets: SemEval-2010444http://www.kozareva.com/downloads.html contains nine relations with an additional NA as a non-relation, and the number of instances for each relation is roughly equal. TACRED555https://catalog.ldc.upenn.edu/LDC2018T24 is about 10 times larger than SemEval, and it has 42 relations including NA, and the number of negative instances accounts for 80% of the entire corpus. For SemEval, we used 10% of the training set for validation, and for TACRED we simply used the dev set as the validation set (see Table 2).

    We filtered out the training and validation instances which had relation triples that appeared in the testing set to eliminate any overlap between relation triples in the training, validation, and testing sets, to simulate the held-out evaluation settings in distant supervision Mintz et al. (2009).

    To simulate FN conditions, we randomly filtered a ratio (10%–50%) of training/validation positives into negatives. Note that the filtering process was only for training/validation: the testing sets were well-labeled under all FN ratios. Also note that the models were not aware in advance which sentences were TN and which were FN.

  2. Distantly Supervised Dataset: The NYT10 dataset666http://iesl.cs.umass.edu/riedel/ecml/ uses Freebase as knowledge base for distant supervision. The relations are extracted from a December 2009 snapshot of Freebase. Four categories of Freebase relations are used: “people”, “business”, “person”, and “location”. These types of relations are chosen because they appear frequently in the newswire corpus. All pairs of Freebase entities that are at least once mentioned in the same sentence are chosen as candidate relation instances. For consistency with previous research Lin et al. (2016); Feng et al. (2018); Qin et al. (2018b), we excluded five relations:

    This results in a total of 53 relations (including none-relation, ’NA’).

    The corpus is chosen from a external source articles published by The New York Times between January 1, 1987 and June 19, 2007. The Freebase relations were divided into two parts, one for training and one for testing. The former is aligned to the years 2005-2006 of the NYT corpus, the latter to the year 2007.

    Note that the FN instance in Table 1 is a real example in NYT10. The original sentence is: “Born in Astoria, New York on July 19, 1924, Manuela was a long term resident of East Rockaway, New York, a graduate of City College, founder and partner in BFW Management, and dedicated long term volunteer and employee of the Helen Keller Services for the Blind.” The head and tail entities are Manuela and New York. Their relation should be ‘/people/person/place_of_birth’, but is labeled as NA in NYT10.


H-FND was implemented with PyTorch 1.6.0 Adam et al. (2017) in python 3.6.9. In our implementation, we used pretrained word embeddings provided by SpaCy Honnibal and Johnson (2015) as the fixed word embeddings (). The positional embedding () was randomly initialized and then trained with the following network; therefore the overall dimension of embedding vector . In the convolutional layer, we applied four different sizes of filters ( [2, 3, 4, 5]) and set all of their feature sizes to . Both CNN and PCNN architectures were implemented. The total trainable parameters of each models are listed in table 4. To prevent overfitting, we inserted dropout layers with a dropout rate of 0.5 before the convolutional layer and after the max pooling layer.

We trained H-FND using the Adam optimizer Kingma and Ba (2015). In addition, we used mini-batches (batch size ) only when training the relation classifier; the prediction of the relation classifier and both the decision and policy gradient of the denoising agent were executed per epoch. Last, the revised result of H-FND in each epoch was used by the classifier only in the same epoch and did not accumulate over epochs, which means that at the beginning of each epoch, H-FND applied the denoising policy on the original dataset but not on the revised dataset of the last epoch.

We list in Table 5 the learning rates for base CNN and PCNN relation classifiers (RC), for RC with SelATT, and for RC with denoising agent (DA) under pretraining and co-training phrases. The learning rate of RC is selected from {1e-4, 3e-4, 1e-3, 3e-3, 1e-2}, with the F1 score on the noise-free version of SemEval and TACRED as the selection criteria. Except SelATT and DA cotraining, the learning rates for the other models are the same to the learning rate of base RC. For SelATT, the learning rate is selected from {1e-6, 3e-6, 1e-5, 3e-5, 1e-4}, also with the F1 score on the noise-free version of the two datasets as the selection criteria. For DA cotraining, the learning rate is selected from {1e-6, 3e-6, 1e-5, 3e-5, 1e-4}, with the F1 score on the SemEval and TACRED under a 50% FN ratio as the selection criteria.

All the RC of each method are trained to converge with validation-based early stopping. In specific, we train all the model for 150 epochs on SemEval and for 200 epochs on TACRED. For NYT, we trained all the odels for 30 epochs.

The pretraining of H-FND trains the RC and DA for 5 and 20 epochs respectively. We select these pretraining periods by the criteria that the two models can achieve about 80% performance comparing to the converged ones. By this means, we can prevent H-FND from overfitting the noisy labels Han et al. (2018) and initialize H-FND with good parameters for co-training.

All the implemented models are trained on NVIDIA GTX 1080 Ti and Intel(R) Xeon(R) Silver 4110 CPU, with 12GN GPU memory, 128GB RAM, clock rate 2.10 GHz, and Linux as the operating system. The expected running time for each model on each dataset is listed in Table 6.

Learning rate SemEval TACRED NYT
3e-3 3e-4 3e-4
1e-5 3e-6
3e-3 3e-4 3e-4
3e-3 3e-4 3e-4
3e-3 3e-4 3e-4
1e-4 3e-6 3e-6
Table 5: Learning rates.
Runtime SemEval TACRED NYT
Base 0.05 0.63 3.25
SelAtt 1.95 22.70
IRMIE 0.05 0.67
Co-teaching 0.10 1.10
Cleanlab 0.25 6.47 16.25
H-FND 0.55 15.28 44.44
Table 6: Runtimes for models training (hrs).

Performance on Validation Set

The F1 scores of each model running on validation sets of SemEval and TACRED are provided in Figure 7 and 8. Notice that the validation sets are noisy in our experiment, so the performance on validation sets do not fully reflect the robustness of each models. Also, in IRMIE and H-FND, the validation sets are modified, so their validation F1 scores can only be compared with their own across different FN ratios. For more accurate performance measurement, please refer to Figure 3 and 4, whose F1 scores are measured on noise-free testing sets.

Figure 7: Validation F1 scores of quantitative result, where the errorbars represent the standard deviations.
Figure 8: Validation F1 scores of ablation analysis, where the errorbars represent the standard deviations.

Denoising policy with Standard Deviations

On SemEval and TACRED, the Denoising policy distributions with standard deviation are provided in Table 7, 8, 9, and 10.

CNN on SemEval 0% 10% 20% 30% 40% 50%
TN/Keep 64.89 6.55 74.04 8.37 87.68 9.74 79.17 10.65 83.03 6.65 76.83 11.19
TN/Discard 24.88 9.59 17.82 5.55 8.81 7.28 12.56 6.53 12.34 4.84 11.17 3.15
TN/Revise 10.23 5.00 8.14 5.90 3.51 2.57 8.27 4.47 4.63 1.93 12.00 9.25
FN/Keep 0.00 0.00 19.04 5.65 55.03 26.98 45.97 24.85 50.91 13.15 45.24 11.06
FN/Discard 0.00 0.00 61.58 10.16 35.29 22.58 38.33 17.48 38.81 10.05 34.70 5.73
FN/Revise 0.00 0.00 19.38 9.77 9.68 5.52 15.70 7.49 10.28 3.75 20.05 10.99
Table 7: Denoising policy distribution for CNN on SemEval (%).
PCNN on SemEval 0% 10% 20% 30% 40% 50%
TN/Keep 73.57 6.19 77.79 4.11 77.74 3.93 73.61 5.11 80.72 6.08 82.95 4.45
TN/Discard 21.63 4.49 16.55 4.71 16.24 3.78 18.76 5.51 13.25 4.96 12.15 4.21
TN/Revise 4.79 2.08 5.67 1.89 6.01 2.16 7.62 2.14 6.04 2.17 4.91 0.68
FN/Keep 0.00 0.00 25.37 5.33 36.46 7.12 38.25 5.81 52.36 9.94 60.38 8.54
FN/Discard 0.00 0.00 62.62 8.65 51.12 8.63 48.45 8.44 36.59 8.87 31.31 8.13
FN/Revise 0.00 0.00 12.00 3.51 12.42 3.42 13.30 3.36 11.05 3.10 8.31 1.42
Table 8: Denoising policy distribution for PCNN on SemEval (%).
CNN on TACRED 0% 10% 20% 30% 40% 50%
TN/Keep 80.48 3.49 85.18 1.36 84.07 4.66 89.14 1.38 90.81 1.95 94.11 2.23
TN/Discard 13.99 2.76 10.99 1.16 11.78 2.36 8.50 1.18 7.60 1.61 5.00 1.92
TN/Revise 5.54 0.84 3.82 0.32 4.15 2.44 2.35 0.37 1.59 0.37 0.90 0.35
FN/Keep 0.00 0.00 36.53 2.21 40.36 1.88 47.42 3.88 53.60 4.16 66.31 7.85
FN/Discard 0.00 0.00 32.40 3.08 34.23 3.38 32.12 3.42 31.86 2.45 24.60 5.73
FN/Revise 0.00 0.00 31.07 1.70 25.42 1.79 20.46 1.86 14.54 1.83 9.09 2.31
Table 9: Denoising policy distribution for CNN on TACRED (%).
PCNN on TACRED 0% 10% 20% 30% 40% 50%
TN/Keep 85.31 0.45 85.91 3.20 88.05 2.73 88.60 2.54 90.63 2.23 92.46 1.57
TN/Discard 10.30 0.53 10.55 2.97 9.10 2.39 9.01 2.07 7.41 2.00 6.27 1.40
TN/Revise 4.39 0.34 3.54 0.32 2.85 0.53 2.39 0.70 1.96 0.31 1.26 0.25
FN/Keep 0.00 0.00 39.10 4.23 45.62 4.58 48.53 6.01 57.50 3.90 64.01 4.44
FN/Discard 0.00 0.00 33.03 5.09 31.68 3.64 33.32 4.97 26.67 3.58 25.55 3.42
FN/Revise 0.00 0.00 27.87 1.88 22.70 1.86 18.15 2.82 15.83 1.44 10.45 1.59
Table 10: Denoising policy distribution for PCNN on TACRED (%).