Looking Beyond Label Noise: Shifted Label Distribution Matters in Distantly Supervised Relation Extraction

04/19/2019 ∙ by Maosen Zhang, et al. ∙ University of Southern California University of Illinois at Urbana-Champaign 0

In recent years there is surge of interest in applying distant supervision (DS) to automatically generate training data for relation extraction. However, despite extensive efforts have been done on constructing advanced neural models, our experiments reveal that these neural models demonstrate only similar (or even worse) performance as compared with simple, feature-based methods. In this paper, we conduct thorough analysis to answer the question what other factors limit the performance of DS-trained neural models? Our results show that shifted labeled distribution commonly exists on real-world DS datasets, and impact of such issue is further validated using synthetic datasets for all models. Building upon the new insight, we develop a simple yet effective adaptation method for DS methods, called bias adjustment, to update models learned over source domain (i.e., DS training set) with label distribution statistics estimated on target domain (i.e., evaluation set). Experiments demonstrate that bias adjustment achieves consistent performance gains on all methods, especially on neural models, with up to a 22 improvement.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Left: Label distributions of KBP (distantly supervised dataset) are shifted, while those of TACRED (human-annotated dataset, 6 classes for illustration) are consistent. Right: Each label would be categorized into intervals along x-axis according to ; height of the bars indicates proportion of instances that fall into each category, revealing that the shift is severe in distantly supervised datasets such as KBP and NYT.

Relation extraction (RE) is an important task in natural language processing which aims to transform massive text corpora into structured knowledge, i.e., identifying pairs of entities and their relation in sentences. To reduce the reliance on human-annotated data, especially for data-hungry neural models 

Zeng et al. (2014); Zhang et al. (2017), there has been extensive studies on leveraging distant supervision (DS) in conjunction with external knowledge bases to automatically generate large-scale training data Mintz et al. (2009); Zeng et al. (2015). While recent DS-based relation extraction methods focus their studies on constructing models that can deal with label noise  Riedel et al. (2010); Hoffmann et al. (2011); Lin et al. (2016), i.e., false-positive labels introduced by the error-prone DS process, it has been shown that simple, feature-based methods can achieve comparable or even stronger performance on current DS datasets Ren et al. (2017); Liu et al. (2017). Therefore, it is desirable to understand why these complex, more expressive neural models cannot achieve superior results on DS datasets.

In this paper, we conduct thorough analysis over both real-world and sythetic datsets to answer the question posed above, and develop a simple and general model adaptation technique to augment all models over DS datasets, based on insights obtained from the analysis results. Our analysis starts with a performance comparison of recent relation extraction methods on both DS datasets (i.e., KBP (Ellis et al., 2012) and NYT (Riedel et al., 2010)) and human-annotated dataset (i.e., TACRED (Zhang et al., 2017)), with the goal of seeking models that can consistently yield strong results. However, we observe that, on human-annotated dataset, neural relation extraction models outperform feature-based models by notable gaps but such gaps diminish when same models are applied to DS datasets—neural models merely achieve performance comparable with feature-based models. We endeavor to analyze the underlying problem that leads to this unexpected “diminishing” phenomenon.

Supported by the analysis results on both real-world and synthetic DS datasets, we reveal an important issue of DS datasets—shifted label distribution—which refers to the problem that the label distribution of train set does not align with that of test set. In practice, there often exists a large margin between label distributions of distantly supervised training set and that of human-annotated evaluation set, as shown in Figure 1. Such label distribution shift is mainly caused by the imbalanced data distribution of external knowledge bases, and false-positive and false-negative labels generated by the error-prone DS process. To some extend, such distortion is a special case of the general domain shift issue—i.e., training the model on a source domain and applying the learned model to a different target domain.

To address the shifted label distribution issue, we develop a simple and general domain adaptation method, called bias adaptation

, which modifies the bias term in classifier and explicitly fits the model along label distribution shift. Specifically, the proposed method estimates label distribution of target domain and derives the adapted prediction under reasonable assumptions. In our experiments, the performance improvement on all models after such adjustment is applied also helps validate that model performance may be severely hindered by label distribution shift. The proposed method is general and can be easily integrated to models that use softmax classifier.

In the rest of the paper, we will first introduce the problem setting and report the inconsistency of model performance with human annotations and distant supervision. Then, we will present the two threshold techniques that we found to be effective on distant supervision, which leads us to the problem of shifted label distribution. We further explore its impact on synthetic datasets in Section 4, and introduce the bias adjustment method in Section 5

. In addition, comparison of denoising method, heuristic threshold and bias adjustment is done in Section 


2 Experiment Setup

For a fair and meaningful comparison, we ensure the same experimental setup in all experiments. All implementations will be available online.

Dataset Distantly Supervised Human-annotated
#Relation Types 7 25 42
#Train Sentences 23784 235982 37311
#Test Sentences 289 395 6277
Table 1: Statistics of Datasets Used in Our Study.

2.1 Problem Setting

Following the previous work Liu et al. (2017); Ren et al. (2017), we conduct relation extraction at sentence level. Formally speaking, the basic unit is the relation mention, which is composed of one sentence and one ordered entity pair within the sentence. The relation extraction task is to categorize each relation mention into a given set of relation types, or a Not-Target-Type (None).

2.2 Datasets

We select three popular relation extraction datasets as benchmarks. Two of them are distantly supervised and one is human-annotated. Statistics of the datasets are summarized in Table 1.

KBP (Ling and Weld, 2012) uses Wikipedia articles annotated with Freebase entries as train set, and manually-annotated sentences from 2013 KBP slot filling assessment results (Ellis et al., 2012) as test set.

NYT (Riedel et al., 2010) contains New York Times news articles and has been already heuristically annotated. Test set is constructed manually by (Hoffmann et al., 2011).

TACRED (Zhang et al., 2017) is a large-scale crowd-sourced dataset, and is sufficiently larger than other manually annotated datasets.

2.2.1 Pre-Processing

We leverage pre-trained GloVe (Pennington et al., 2014) embedding111http://nlp.stanford.edu/data/glove.840B.300d.zip, and use the StanfordNLP toolkit (Manning et al., 2014)

to get part of speech (POS) tags, named-entity recognition (NER) tags and dependency parsing trees.

As for the development set, we use the provided dev set on TACRED, and randomly split 10% of training set as dev set on KBP and NYT.

2.3 Models

We consider two popular classes of relation extraction models here, i.e., feature-based models and neural models. For each relation mention, these models will first construct a representation , and then make predictions based on

. All models, except CoType-RM, adopt a softmax layer to make predictions:


where and are the parameters corresponding to i-th relation type.

Method / Dataset Distantly-supervised Human-annotated
Feature-based CoType-RM (Ren et al., 2017) 28.98 0.76 40.26 0.51 45.97 0.34
ReHession Liu et al. (2017) 36.07 1.06 46.79 0.75 58.31 0.67
Neural CNN Zeng et al. (2014) 30.53 2.26 46.75 2.79 56.96 0.43
PCNN Zeng et al. (2015) 33.15 0.93 44.63 2.70 58.39 0.71
SDP-LSTM (Xu et al., 2015) 32.70 1.84 43.11 0.59 61.57 0.60
Bi-GRU 37.77 0.18 47.88 0.85 66.31 0.33
Bi-LSTM 34.51 0.99 48.15 0.87 62.58 0.47
PA-LSTM Zhang et al. (2017) 37.28 0.81 46.33 0.64 65.69 0.48
Table 2: Performance Comparison of RE Models (F1 score std, in percentage)

2.3.1 Feature-based models

We included two recent feature-based models, i.e., ReHession Liu et al. (2017) and CoType Ren et al. (2017). For each relation mention , these two methods would first extract a list of features, . Detailed descriptions of these features are in Appendix A Table 4.

CoType-RM is a variant of CoType (Ren et al., 2017)

, a unified learning framework to get both the feature embedding and the label embedding. It leverages a partial-label loss to handle the label noise, and uses cosine similarity to conduct inference. Here, we only use its relation extraction part.

ReHession (Liu et al., 2017)

directly maps each feature to an embedding vector, treats their average as the relation mention representation, and uses a softmax to make predictions. This method was initially proposed for heterogeneous supervision, and is modified to fit our distantly supervised RE task. Specifically, for a relation mention annotated with a set of relation

, it would first calculate a cross entropy as the loss function:


where is defined in Equation 1, and is used to encode supervision information in a self-adapted manner:

We can find that when (i.e.only one label is assigned to the relation mention), would be one-hot and Equation 2 becomes standard cross entropy loss.

2.3.2 Neural Models

We employed several popular neural structure to calculate the sentence representation . As for the loss function, cross entropy as shown in Equation 2 is used for all following neural models.

Bi-LSTMs and Bi-GRUs use bidirectional RNNs to encode sentences and concatenate their final states in the last layer as the representation. Following previous work Zhang et al. (2017), we use two types of RNNs—Bi-LSTMs and Bi-GRUs. Both of them have 2 layers with 200d hidden state in each layer.

Position-Aware LSTM computes sentence representation with an attention over the outputs of LSTMs. It treats the last hidden state as the query and integrates a position encoding to calculate the attention (Zhang et al., 2017).


use convolution neural networks as the encoder. In particular, CNN directly appends max-pooling after the convolution layer 

Zeng et al. (2014); PCNN uses entities to split each sentence into three pieces, and does max-pooling respectively on these three pieces after the convolution layer. Their outputs are concatenated as the final output Zeng et al. (2015).

SDP-LSTM splits the shortest dependency path between entities into two sub-paths with their common ancestor nodes. Two sub-paths are fed into LSTM respectively, and two representations are concatenated as  (Xu et al., 2015).

2.4 Training Details

We run each model for 5 times and report the average F1 and standard deviation.


We use Stochastic Gradient Descent (SGD) for all models. Learning rate is set at 1.0 initially, and is dynamically updated during training. Once the loss on the dev set has no improvement for 3 consecutive epochs, the learning rate will be factored by 0.1.

Hyper-parameters. For ReHession, dropout is applied on input features and after average pooling. We tried the two dropout rates in .

For Position Aware LSTM, Bi-LSTM and Bi-GRU, dropout (Srivastava et al., 2014) is applied at the input of the RNN, between RNN layers and after RNN before linear layer. Following Melis et al. (2017)

we tried input and output dropout probability in

, and intra-layer dropout probability in . We consider them as three separate hyper-parameters and tune them greedily. Following previous work Zhang et al. (2017), dimension of hidden states are set at 200.

Following previous work Lin et al. (2016), the number of kernels is set to 230 for CNN or PCNN, and the window size is set at 3. Dropout is applied after pooling and activation . We tried the dropout rates in .

3 The Diminishing Phenomenon

Figure 2: F1 score comparison among three popular models (ReHession, Bi-GRU and Bi-LSTM). Bi-GRU outperforms ReHession with a significant gap on TACRED, but only has comparable performance (with ReHession) on KBP and NYT. Similar gap diminishing phenomenon happens to Bi-LSTM.

Neural models alleviate the reliance on hand-crafted features and further push the state-of-the-art on datasets with human annotations. However, we observe such performance boost starts to diminish on distant supervision. Specifically, we list the performance of all models in Table 2 and summarize three popular models in Figure 2.

On TACRED, a human annotated dataset, complex neural models like PA-LSTM and Bi-GRU significantly outperform feature-based methods, with an up to 14% relative F1 improvement. On the other hand, on distantly supervised datasets (KBP and NYT), the performance gap between these two kinds of methods diminishes to a roughly 5% relative F1 improvement.

We refer to this observation as “diminishing” phenomenon. Such observation implies a lack of handling of the underlying difference between human annotations and distant supervision. In a broad exploration, we found two heuristic techniques that can boost the performance on distantly supervised datasets a lot, but fail to do so on human-annotated dataset. We believe these techniques are capturing problems exclusive to distantly supervised RE, and may be related to “diminishing” phenomenon. To get deeper insights, we further analyze the diminishing phenomenon and the two heuristic methods empirically.

Figure 3: F1 scores on three datasets when (a) no threshold (original) (b) max threshold (c) entropy threshold is applied. A clear boost is observed on distantly supervised datasets (i.e. KBP and NYT) after threshold is applied; performance remains the same on human-annotated TACRED. Heuristic threshold techniques may capture some underlying problems in distantly supervised relation extraction.

3.1 Heuristic Threshold Techniques

Max threshold and entropy threshold are designed to identify the “ambiguous” relation mentions (i.e., predicted with a low confident) and label them as the None type Liu et al. (2017); Ren et al. (2017). In particular, referring to the original predictions as , we now formally introduce these two threshold techniques:

  • [leftmargin=10pt]

  • Max Threshold introduces an additional hyper-parameter , and adjust the prediction as Ren et al. (2017):

  • Entropy Threshold introduces an additional hyper-parameter and calculates the entropy of prediction:

    It then adjust the prediction as Liu et al. (2017):

To estimate or , we sample 10% instances from the test set as an additional development set, and tune the value of and with grid search. After that, we would evaluate the model performance on the rest (90%) of the test set. We refer to this new dev set as clean dev and refer to the original dev set as noisy dev.

We would like to highlight that tuning threshold on clean dev is necessary as it acts as a bridge between distantly supervised train set and human-annotated test set. Meanwhile, all learned model parameters are not updated in this threshold method and the only difference is introducing an additional parameter during evaluation.

3.2 Results and Discussion

Results of three representative models (ReHession, Bi-GRU and Bi-LSTM) are summarized in Figure 3. Full results are listed in Table 3.

We observe significant improvements on distantly supervised datasets (i.e., KBP and NYT), with a up to relative F1 improvement (Bi-GRU from 37.77% to 44.22%). However, on the human-annotated corpus, the performance boost can hardly be noticed. Such inconsistency implies that these heuristics may capture some important but overlooked factors for distantly supervised relation extraction, while we are still unclear about their underlying mechanisms.

Intuitively, distant supervision differs from human annotations in two ways: (1) False Positive: falsely annotating unrelated entities in one sentence as a certain relation type; (2) False Negative: neglecting related entities. These label noises distort the true label distribution of train corpora, creating a gap between the label distribution of train and test set (i.e., shifted label distribution). With existing denoising methods, the effect of noisy training instances may be reduced; still, it would be infeasible to recover the original label, and thus label distribution shift remains an unsolved problem.

In our experiments, we notice that in distantly supervised datasets, instances labeled as None have a larger portion in test set than in train set. It is apparent that the strategy of rejecting “ambiguous” predictions would guide the model to predict more None types, leading the predicted label distribution towards a favorable direction. Specifically, in the train set of KBP, 74.25% instances are annotated as None and 85.67% instances are annotated in the test set. The original prediction of Bi-GRU would annotate 75.72% instances to be None, which is similar to 74.25%; after applying the max-threshold and entropy-threshold, this proportion becomes 86.18% and 88.30%, which are close to 85.67%.

Accordingly, we believe part of the underlying mechanism of heuristic threshold enables better handling of label distribution shift, and we try to further verify this hypothesis with experiments in next section.

4 Shifted Label Distribution Matters

In this section, we will first describe the shifted label distribution problem in detail, and then empirically study its influence on model performance.

4.1 Shifted Label Distribution

Shifted label distribution refers to the problem that the label distribution of train set does not align with the test set. This problem is related to but different from “learning from imbalanced data”, where the data have severe label distribution skews. Admittedly, one relation may appear more or less than another in natural language, creating distribution skews. However, this problem widely occurs in both supervised and distantly supervised settings, and is not our focus in this paper.

Our focus is the label distribution difference between train and test set in this paper. This problem is critical to distantly supervised relation extraction, where the train set is annotated with distant supervision and the test set is manually annotated. As previously mentioned in 3.2, distant supervision differs from human annotations by introducing false positive and false negative labels. The label distribution of train set is subject to existing entries in KBs, and thus there exists a gap between label distributions of train and test set.

We visualize the distribution of KBP, NYT (both are distantly-supervised datasets) and a truncated 6-class version of TACRED (human-annotated dataset) in Figure 1. It is observed that KBP and NYT both have shifted label distribution; while TACRED has a relatively consistent label distribution.

4.2 Impact of Shifted Label Distribution

Figure 4: F1 scores on synthesized shifted datasets -. We observe that (a) Performance consistently drops from to , indicating the importance of shifted distribution; (b) ReHession is more robust to label distribution shift, outperforming Bi-LSTM and Bi-GRU on and ; (c) Applying threshold is an effective way against such shift for all three models.

In order to study the impact of label distribution shift quantitatively, we construct synthetic datasets by sub-sampling instances from the human-annotated TACRED dataset. In this way, the only variable is the label distribution of synthetic datasets, and thus the impact of other factors such as label noise is excluded.

Specifically, we keep the number of instances in the sampled training set within and create five sub-sampled datasets, with label distributions -. is a randomly generated label distribution (plotted in Appendix A Figure 7).

are calculated as a linear interpolation between

and the label distribution of TACRED’s original training set (referred as ), i.e., .

We conduct experiments with three typical models (i.e., Bi-GRU, Bi-LSTM and ReHession) and summarize the results in Fig 4. We observe that, from to , the performance of all models consistently drops. This phenomenon verifies that shifted label distribution is making a negative influence on model performance. The negative effect increases when the train set label distribution becomes more twisted. At the same time, we observe that feature-based ReHession is more robust to such shift. The gap between ReHession and Bi-GRU stably decreases, and eventually ReHession starts to outperform the other two at . This could be the reason accounting for “diminishing” phenomenon— Neural models such as Bi-GRU is supposed to outperform ReHession by a huge gap (as with ); however on distantly supervised datasets, shifted label distribution seriously interfere the performance (as with and ), and thus the performance gap diminishes.

We also applied the two heuristic threshold techniques on the five synthesized datasets and summarize their performance in Fig 4. After applying threshold techniques, the three models become more robust to the label distribution shift. This observation verifies that the underlying mechanism of threshold techniques can help the model better handle label distribution shift.

5 Bias Adjustment: An Adaptation Method in Evaluation

Investigating the probabilistic nature of softmax classifier, we present a principled approach from a domain adaptation perspective, which adjusts the bias term in softmax classifier during evaluation. This approach explicitly fits the model along the label distribution shift.

5.1 Bias Adaptation

We view corpora with different label distributions as different domains. Denoting as a distantly supervised corpus (train set) and as a human-annotated corpus (test set), our task becomes to calculate based on .

We assume that the semantic meaning of each label is unchanged across and :


As distantly supervised relation extraction models are trained with , its prediction in Equation 1 can be viewed as , i.e.,


Based on the Bayes Theorem, we have:


Based on the definition of conditional probability and Equation 3, we have:


With Equation 4, 5 and 6, we can derive that




With this derivation we now know that, under certain assumptions (Equation 3), we can adjust the prediction to fit a target label distribution with and .

can be easily estimated on train set. In our experiment, is directly estimated on full test set, for the sole purpose of evaluation. We want to highlight additional information is only used for evaluating how models are influenced by distribution shift, instead of aiming to improve performance. An accurate estimate of is reasonable and necessary for this purpose.

Figure 5: F1 Improvement using BA-Fix. BA-Fix consistently improves performance in compared models.

Accordingly, we use Equation 7 and 8 to calculate the adjusted prediction as:


We implement bias adjustment in two ways:

  • [leftmargin=10pt]

  • BA-Set directly replaces the bias term in Equation 1 with in Equation 8. It does not require any modification to model training and can be directly applied during evaluation.

  • BA-Fix fixes the bias term in Equation 1 as during the training and replaces it with during evaluation. Intuitively, BA-Fix would encourage the model to fit our assumption better (Equation 3); still, it needs special handling during model training, which is a minor disadvantage of BA-Fix.

Dataset Model Original Max-thres Ent-thres BA-Set BA-Fix
KBP ReHession 36.07 1.06 39.66 0.98 39.14 0.94 37.39 0.68 38.38 0.26
Bi-GRU 37.77 0.18 42.25 1.29 44.22 1.76 41.35 1.08 42.84 0.82
Bi-LSTM 34.51 0.99 38.68 1.95 39.28 0.67 38.08 1.10 39.18 0.75
PA-LSTM 37.28 0.81 42.63 1.12 43.26 1.32 41.88 1.37 42.44 1.32
CNN 30.53 2.26 34.89 2.22 36.18 2.48 31.39 3.00 35.45 1.96
PCNN 33.15 0.93 33.69 0.93 36.14 0.96 33.42 0.67 40.67 1.95
NYT ReHession 46.79 0.75 48.88 0.60 48.64 0.54 48.69 0.65 48.19 0.20
Bi-GRU 47.88 0.85 48.48 1.50 48.04 1.61 49.26 1.04 49.62 1.74
Bi-LSTM 48.15 0.87 48.93 0.94 48.81 1.16 50.15 0.62 50.29 0.84
PA-LSTM 46.33 0.64 46.63 0.75 46.11 0.83 48.91 1.01 49.76 0.86
CNN 46.75 2.79 46.45 2.91 47.45 2.33 49.87 2.06 48.47 1.85
PCNN 44.63 2.70 46.72 2.81 47.69 0.59 49.53 1.01 50.19 1.10
Table 3: F1 score of RE Models with Threshold and Bias Adaptation. Five-time average and standard deviation is reported.

5.2 Results and Discussion

Experiment results are listed in Table 3. It is observed that with BA-Set and BA-Fix applied, F1 scores of all models on distantly supervised datasets are consistently improved. In the case of PCNN on KBP dataset, an up to relative F1 improvement is observed. This supports our argument that shifted label distribution is an important issue in distantly supervised RE, and existing models are not handling this issue properly.

Also, we observe that performance improvement of applying bias adjustment is significant on neural models such as PA-LSTM and PCNN, but is comparatively smaller on feature-based ReHession. This is similar to the observation in heuristic threshold.

Noting that only bias terms in classifier are modified and only a small piece of extra information is used, we are again convinced that shifted label distribution is severely hindering model performance and may be limiting the power of neural models. Hidden representations

learned by neural models indeed capture semantic meanings more accurately than feature-based model, while the bias in classifier becomes the major obstacle towards better performance.

6 Comparison with Denoising Methods

To demonstrate the impact of the shifted label distribution, we compare the performance gain obtained from handling label distribution shift with those obtained from reducing label noise. These are two separate factors influencing model performance. Extensive efforts have been made to solve the latter while the former is often overlooked. The purpose of the comparison is to emphasize the importance of label distribution shift and encourage future research in this direction.

We apply a popular label noise reduction method–selective attention Lin et al. (2016), which groups all sentences with the same entity pair into one bag, conducts multi-instance training and tries to place more weight on high-quality sentences within the bag. Selective attention, along with threshold techniques and bias adjustment introduced in previous sections, are applied to two representative models (i.e., PCNN and Bi-GRU). We summarize their improvements over the original model in Figure 6.

Selective attention can indeed improve the performance; meanwhile, heuristic threshold and bias adaption approaches also boost the performance, and in some cases the boost is even more significant than that of selective attention. This observation is reasonable since both heuristics and bias adaption approaches are able to access additional information. Still, it is surprising that such small piece of information brings about huge difference, demonstrating the importance of handling shifted label distribution. Also, it shows that there exists much space for improving distantly supervised relation extraction models from a perspective of shifted label distribution.

Figure 6: Comparison among selective attention, threshold heuristics and bias adaption approaches. It can be observed that threshold heuristics and bias adaption approaches could bring more significant improvements.

7 Related Work

Neural Relation Extraction Relation extraction is to identify the relationship between entity mentions in a sentence. Recent approaches rely on considerable labeled instances to train effective models. Zeng et al. (2014) proposed using CNN for relation extraction, which could automatically capture features from texts. Zeng et al. (2015) further extended it with piecewise max-pooling. Lin et al. (2016) apply sentence selective attention for learning from multiple instances. Zhang et al. (2017) combined attention mechanism with position information.

Distant Supervision In supervised relation extraction paradigm, this task suffers from lack of large-scale labeled training data. In order to alleviate the dependency of human supervision, Mintz et al. (2009) proposed distant supervision, namely constructing large datasets by heuristically aligning text to an existing Knowledge Base. Though lightening annotation burdens, distant supervision inevitably introduces label noises. Text could be annotated with labels that are not expressed in the sentence, since the relation labels are annotated merely according to entity mentions in the sentence. There are several existing methods for dealing with the label noises: Riedel et al. (2010) use multi-instance single-label learning paradigm; Hoffmann et al. (2011); Surdeanu et al. (2012)

propose multi-instance multi-label learning paradigm. Recently, with the advance of neural network techniques, deep learning methods

Zeng et al. (2015); Lin et al. (2016); Feng et al. (2018) are introduced to reduce the impact of label noises.

8 Conclusion

In this paper, we first present the observation of inconsistent performance when models are trained with human annotations and distant supervision in the task of relation extraction. It leads us to explore the underlying challenges for distantly supervised relation extraction. Relating two effective threshold techniques to label distribution, we reveal an important yet long-overlooked issue – shifted label distribution. The impact of this issue is further demonstrated with experiments on synthetic datasets. We also consider this issue from a domain adaptation perspective, introducing a bias adjustment method in evaluation to recognize and highlight label distribution shift.

Based on these observations, we suggest that in addition to label noise, more attention be paid to the shifted label distribution issue in distantly-supervised relation extraction research. We hope that the analysis presented will provide new insights into this long-overlooked issue and encourage future research of creating models robust to label distribution shift. We also hope that methods such as threshold techniques and bias adjustment become useful evaluation tools in future research.


  • Ellis et al. (2012) Joe Ellis, Xuansong Li, Kira Griffitt, Stephanie Strassel, and Jonathan Wright. 2012. Linguistic resources for 2013 knowledge base population evaluations. In TAC.
  • Feng et al. (2018) Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xiaoyan Zhu. 2018. Reinforcement learning for relation classification from noisy data. In Proceedings of AAAI.
  • Hoffmann et al. (2011) Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 541–550. Association for Computational Linguistics.
  • Lin et al. (2016) Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2124–2133.
  • Ling and Weld (2012) Xiao Ling and Daniel S Weld. 2012. Fine-grained entity recognition. In AAAI, volume 12, pages 94–100.
  • Liu et al. (2017) Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, and Jiawei Han. 2017. Heterogeneous supervision for relation extraction: A representation learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 46–56.
  • Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60.
  • Melis et al. (2017) Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589.
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Ren et al. (2017) Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R Voss, Heng Ji, Tarek F Abdelzaher, and Jiawei Han. 2017. Cotype: Joint extraction of typed entities and relations with knowledge bases. In Proceedings of the 26th International Conference on World Wide Web, pages 1015–1024. International World Wide Web Conferences Steering Committee.
  • Riedel et al. (2010) Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    , pages 148–163. Springer.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
  • Surdeanu et al. (2012) Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 455–465. Association for Computational Linguistics.
  • Xu et al. (2015) Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015.

    Classifying relations via long short term memory networks along shortest dependency paths.

    In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1785–1794.
  • Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1753–1762.
  • Zeng et al. (2014) Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2335–2344.
  • Zhang et al. (2017) Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D Manning. 2017. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45.

Appendix A Additional Figures and Tables

This section contains a table for manually designed features used in feature-based models. It also contains figures of shifted label distribution of datasets used in our experiments (NYT, TACRED, synthesized TACRED ).

Table 4 shows the features used in two feature-based models.

Figure 7 shows the full label distribution of TACRED dataset, and the simulated TACRED dataset with a shifted distribution.

Figure 8 shows the full label distribution of NYT dataset. NYT is constructed with distant supervision and has a shifted distribution.

Feature Name Description Example
Brown cluster Brown cluster ID for each token BROWN_010011001
Part-of-speech (POS) tag POS tags of tokens between two EMs VBD” , “VBN” , “IN
Entity Mention Token Tokens in each entity mention TKN_EM1_Hussein
Entity mention (EM) head Syntactic head token of each entity mention HEAD_EM1_HUSSEIN
Entity mention order whether EM 1 is before EM 2 EM1_BEFORE_EM2
Entity mention distance number of tokens between the two EMs EM_DISTANCE_3
Entity mention context unigrams before and after each EM EM1_AFTER_was
Tokens Between two EMs each token between two EMs was” , “born” , “in
Collocations Bigrams in left/right 3-word window of each EM Hussein was” , “in Amman
Table 4: Text features used in feature-based models. (”Hussein”, ”Amman”, ”Hussein was born in Amman”) is used as an example.
Figure 7: Top: Label distribution of original TACRED; Bottom: Using a randomly generated distribution for train set, and keeping original test set. Label distribution of other synthesized datasets (-) are generated with linear interpolation of these two train set distributions.
Figure 8: Label Distribution of original NYT. Similar to KBP, the distributions are shifted as well.