Log In Sign Up

Separating Retention from Extraction in the Evaluation of End-to-end Relation Extraction

by   Bruno Taillé, et al.

State-of-the-art NLP models can adopt shallow heuristics that limit their generalization capability (McCoy et al., 2019). Such heuristics include lexical overlap with the training set in Named-Entity Recognition (Taillé et al., 2020) and Event or Type heuristics in Relation Extraction (Rosenman et al., 2020). In the more realistic end-to-end RE setting, we can expect yet another heuristic: the mere retention of training relation triples. In this paper, we propose several experiments confirming that retention of known facts is a key factor of performance on standard benchmarks. Furthermore, one experiment suggests that a pipeline model able to use intermediate type representations is less prone to over-rely on retention.


page 1

page 2

page 3

page 4


Contextualization and Generalization in Entity and Relation Extraction

During the past decade, neural networks have become prominent in Natural...

Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!

Despite efforts to distinguish three different evaluation setups (Bekoul...

Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data

The process of collecting and annotating training data may introduce dis...

TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task

TACRED (Zhang et al., 2017) is one of the largest, most widely used crow...

MedDistant19: A Challenging Benchmark for Distantly Supervised Biomedical Relation Extraction

Relation Extraction in the biomedical domain is challenging due to the l...

Using Graphs of Classifiers to Impose Declarative Constraints on Semi-supervised Learning

We propose a general approach to modeling semi-supervised learning (SSL)...

End-to-end Neural Information Status Classification

Most previous studies on information status (IS) classification and brid...

1 Introduction

Code for reproducing our evaluation settings is available at

Information Extraction (IE) aims at converting the information expressed in a text into a predefined structured format of knowledge. This global goal has been divided into subtasks easier to perform automatically and evaluate. Hence, Named Entity Recognition (NER) and Relation Extraction (RE) are two key IE tasks among others such as Coreference Resolution (CR), Entity Linking or Event Extraction. Traditionally performed as a pipeline Bach2007AExtraction, these two tasks can be tackled jointly in order to model their interdependency, alleviate error propagation and obtain a more realistic evaluation setting Roth2002ProbabilisticRecognition; Li2014IncrementalRelations

. Following the general trend in Natural Language Processing (NLP), the recent quantitative improvements reported on Entity and Relation Extraction benchmarks are at least partly explained by the use of larger and larger pretrained Language Models (LMs) such as BERT


to obtain contextual word representations. Concurrently, there is a realization that new evaluation protocols are necessary to better understand the strengths and shortcomings of the obtained neural network models, beyond a single holistic metric on an hold-out test set

ribeiro-etal-2020-beyond. In particular, generalisation to unseen data is a key factor in the evaluation of deep neural networks. It is all the more important in IE tasks that revolve around the extraction of mentions: small spans of words that are likely to occur in both the evaluation and training datasets. This lexical overlap has been shown to be correlated to neural networks performance in NER Augenstein2017GeneralisationAnalysis; Taille2020ContextualizedGeneralization. For pipeline RE, rosenman-etal-2020-exposing and peng-etal-2020-learning expose shallow heuristics in neural models: relying too much on the type of the candidate arguments or on the presence of specific triggers in their contexts. In end-to-end Relation Extraction, we can expect that these NER and RE heuristics are combined. In this work, we argue that current evaluation benchmarks measure both the desired ability to extract information contained in a text but also the capacity of the model to simply retain labeled (head, predicate, tail) triples during training. And when the model is evaluated on a sentence expressing a relation seen during training, it is hard to disentangle which of these two behaviours is predominant. However, we can hypothesize that the model can simply retrieve previously seen information acting like a mere compressed form of knowledge base probed with a relevant query. Thus, testing on too much examples with seen triples can lead to overestimate the generalizability of a model. Even without labeled data, LMs are able to learn some relations between words that can be probed with cloze sentences where an argument is masked petroni-etal-2019-language. This raises the additional question of lexical overlap with the orders of magnitude larger unlabeled LM pretraining corpora that will remain out of scope of this paper.

2 Datasets and Models

We study three recent end-to-end RE models on CoNLL04 roth-yih-2004-linear, ACE05 Walker2005 and SciERC luan-etal-2018-multi. They rely on various pretrained LMs and for a fairer comparison, we use BERT devlin-etal-2019-bert on ACE05 and CoNLL04 and SciBERT beltagy-etal-2019-scibert on SciERC111More implementation details in Appendix A .


Zhong2021AExtraction follows the pipeline approach. The NER model is a classical span-based model sohrab-miwa-2018-deep. Special tokens corresponding to each predicted entity span are added and used as representation for Relation Classification. For a fairer comparison with other models, we study the approximation model that only requires one pass in each encoder and limits to sentence-level prediction. However, it still requires finetuning and storing two pretrained LMs instead of a single one for the following models.



uses a similar span-based NER module. RE is performed based on the filtered representations of candidate arguments as well as a max-pooled representation of their middle context. While Entity Filtering is close to the pipeline approach, the NER and RE modules share a common entity representation and are trained jointly. We also study the ablation of the max-pooled context representation that we denote


Two are better than one (TABTO)

wang-lu-2020-two intertwines a sequence encoder and a table encoder in a Table Filling approach miwa-sasaki-2014-modeling. Contrary to previous models the pretrained LM is frozen and both the final hidden states and attention weights are used by the encoders. The prediction is finally performed by a Multi-Dimensional RNN (MD-RNN). Because it is not based on span-level predictions, this model cannot detect nested entities, e.g. on SciERC.

3 Partitioning by Lexical Overlap

Following Augenstein2017GeneralisationAnalysis; Taille2020ContextualizedGeneralization, we partition the entity mentions in the test set based on lexical overlap with the training set. We distinguish Seen and Unseen mentions and also extend this partition to relations. We denote a relation as an Exact Match if the same (head, predicate, tail) triple appears in the train set; as a Partial Match if one of its arguments appears in the same position in a training relation of same type; and as New otherwise. We implement a naive Retention Heuristic that tags an entity mention or a relation exactly present in the training set with its majority label. We report micro-averaged Precision, Recall and F1 scores for both NER and RE in LABEL:table:overlap. An entity mention is considered correct if both its boundaries and type have been correctly predicted. For RE, we report scores in the Boundaries and Strict settings bekoulis-etal-2018-adversarial; Taille2020LetsExtraction. In the Boundaries setting, a relation is correct if its type is correct and the boundaries of its arguments are correct, without considering the detection of their types. The Strict setting adds the requirement that the entity type of both argument is correct.

3.1 Dataset Specificities

We first observe very different statistics of Mention and Relation Lexical Overlap in the three datasets, which can be explained by the singularities of their entities and relations. In CoNLL04, mentions are mainly Named Entities denoted with proper names while in ACE05 the surface forms are very often common names or even pronouns, which explains the occurrence of training entity mentions such as "it", "which", "people" in test examples. This also leads to a weaker entity label consistency fu-etal-2020-interpretable: "it" is labeled with every possible entity type and appears mostly unlabeled whereas a mention such as "President Kennedy" is always labeled as a person in CoNLL04. Similarly, mentions in SciERC are common names which can be tagged with different labels and they can also be nested. Both the poor label consistency as well as the nested nature of entities hurt the performance of the retention heuristic. For RE, while SciERC has almost no exact overlap between test and train relations, ACE05 and CoNLL04 have similar levels of exact match. The larger proportion of partial match in ACE05 is explained by the pronouns that are more likely to co-occur in several instances. The difference in performance of the heuristic is also explained by a poor relation label consistency.

3.2 Lexical Overlap Bias

As expected, this first evaluation setting enables to expose an important lexical overlap bias, already discussed in NER, in end-to-end Relation Extraction. On every dataset and for every model micro F1 scores are the highest for Exact Match relations, then Partial Match and finally totally unseen relations. This is a first confirmation that retention plays an important role in the measured overall performance of end-to-end RE models.

3.3 Model Comparisons

While we cannot evaluate TABTO on SciERC because it is unfit for extraction of nested entities, we can notice different hierarchies of models on every dataset suggesting that there is no one-size-fits-all best model, at least in current evaluation settings. The most obvious comparison is between SpERT and Ent-SpERT where the explicit representation of context is ablated. This results in a loss of performance on the RE part and especially on partially matching or new relations for which the entity representations pairs have not been seen. Ent-SpERT is particularly effective on Exact Matches on CoNLL04, suggesting its retention capability. Other comparisons are more difficult, given the numerous variations between the very structure of each model as well as training procedures. However, the PURE pipeline setting seems to only be more effective on ACE05 where its NER performance is significantly better, probably because learning a separate NER and RE encoder enables to learn and capture more specific information for each distinctive task. Even then, TABTO yields better Boundaries performance only penalized on the Strict setting by entity types confusions. On the contrary, on CoNLL04, TABTO significantly outperforms its counterparts, especially on unseen relations. This indicates that it proposes a more effective incorporation of contextual information in this case where relation and argument types are mapped bijectively. On SciERC, performance of all models is already compromised at the NER level before the RE step, which makes further distinction between model performance even more difficult.

4 Swapping Relation Heads and Tails

A second experiment to validate that retention is used as a heuristic in models’ predictions is to modify their input sentences in a controlled manner similarly to what is proposed in ribeiro-etal-2020-beyond. We propose a very focused experiment that consists in selecting asymmetric relations that occur between entities of same type and swap the head with the tail in the input. If the model predicts the original triple, then it over relies on the retention heuristic, whereas finding the swapped triple is an evidence of broader context incorporation. We show an example in LABEL:table:ex_swap. Because of the requirements of this experiment, we have to limit to two relations in CoNLL04: “Kill” between people and “Located in” between locations. Indeed, CoNLL04 is the only dataset with a bijective mapping between the type of a relation and the types of its arguments and the consistent proper nouns mentions makes the swaps mostly grammatically correct. For each relation type, we only consider sentences with exactly one instance of corresponding relation and swap its arguments. We only consider this relation in the RE scores reported in LABEL:table:swap_rev. We use the strict RE score as well as revRE

which measures the extraction of the reverse relation, not expressed in the sentence. For each relation, the hierarchy of models corresponds to the overall CoNLL04. Swapping arguments has a limited effect on NER, mostly for the "Located in" relation. However, it leads to a drop in RE for every model and the revRE score indicates that SpERT and TABTO predict the reverse relation more often than the newly expressed one. This is another proof of the retention heuristic of end-to-end models, although it might also be attributed to the language model to the language model. In particular for the ”Located in“ relation, swapped heads and tails are not exactly equivalent since the former are mainly cities and the latter countries. On the contrary, the PURE model is less prone to information retention, as shown by its revRE scores significantly smaller than the standard RE scores on swapped sentences. Hence, it outperforms SpERT and TABTO on swapped sentences despite being the least effective on the original dataset.The important discrepancy in results can be explained by the different types of representations used by these models. The pipeline approach allows the use of argument type representations in the Relation Classifier whereas most end-to-end models use lexical features in a shared entity representation used for both NER and RE. These conclusions from quantitative results are validated qualitatively. We can observe that the four predominant patterns are intuitive behaviours on sentences with swapped relations: retention of the incorrect original triple, prediction of the correct swapped triple and prediction of none or both triples. We report some examples in

LABEL:table:qualitative_kill and LABEL:table:qualitative_loc in the Appendix.

5 Related Work

Several works on generalization of NER models mention lexical overlap with the training as a key indicator of performance. Augenstein2017GeneralisationAnalysis

separate mentions in the test set as seen and unseen during training and measure out-of-domain generalization in an extensive study of two CRF based models and SENNA combining a Convolutional Neural Network with a CRF

(Collobert2011NaturalScratch). Taille2020ContextualizedGeneralization compare the effect of introducing contextual embeddings in the classical BiLSTM-CRF architecture in a similar setting and show that they help close the performance gap on unseen mentions and domains. arora-etal-2020-contextual; Fu2020RethinkingStudy; fu-etal-2020-interpretable study the influence of several properties such as lexical overlap, label consistency and entity length on state-of-the-art models performance. They model these properties as continuous scores associated to each mention and bucketized for evaluation. Lexical overlap has also been mentioned in Coreference Resolution moosavi-strube-2017-lexical where coreferent mentions tend to co-occur in the test and train sets. In this line of works, the impact of lexical overlap is measured either by separating performance depending on the property of mentions (seen or unseen) or with out-of-domain evaluation with a test set from a different dataset with lower lexical overlap with the train set. Another recently proposed method for fine-grained evaluation of NLP models beyond a single benchmark score is to modify the test sentences in a controlled manner. mccoy-etal-2019-right expose lexical overlap as a shallow heuristic adopted by state-of-the-art Natural Language Inference models, especially by swapping subject and object of verbs in the hypothesis of some examples where the premise entails the hypothesis. While such a modification changes the label of these examples to non-entailment, all models tested show a spectacular drop of accuracy on these models. ribeiro-etal-2020-beyond propose a broader set of test set modifications to individually test robustness of NLP models to several patterns such as the introduction of negation, swapping words with synonyms, changing tense and much more. In pipeline RE where ground truth candidate arguments are given, models often use intermediate representations based on entity types that reduce lexical overlap issues. However, rosenman-etal-2020-exposing show that they still tend to adopt shallow heuristics based on the type of the arguments and the presence of triggers indicative of the presence of a relation. They propose hard cases with several mentions of same types for which Relation Classifiers struggle connecting the correct pair. Concurrently, peng-etal-2020-learning confirm that RE benchmarks present shallow cues such as the type of the candidate arguments that can be used alone to infer the relation. We propose to extend previous work on NER and RE to the more realistic end-to-end RE setting with two of the previously described approaches: 1) separating performance by lexical overlap of mentions or argument pairs and 2) modifying some CoNLL04 test examples by swapping relations heads and tails.

6 Conclusion

In this paper, we study three state-of-the-art end-to-end Relation Extraction models in order to highlight their tendency to retain seen relations. We confirm that retention of seen mentions and relations play an important role in overall RE performance and can explain the relatively higher scores on CoNLL04 and ACE05 compared to SciERC. Furthermore, our experiment on swapping relation heads and tails tends to show that the intermediate manipulation of type representations instead of lexical features enabled in the pipeline PURE model makes it less prone to over-rely on retention. While the limited extend of our swapping experiment is an obvious limitation of this work, it shows limitations of both current benchmarks and models. It is an encouragement to propose new benchmarks that might be easily modified by design to probe such lexical overlap heuristics. Contextual information could for example be contained in templates of that would be filled with different (head, tail) pairs either seen or unseen during training. Furthermore, pretrained Language Models can already capture relational information between phrases (petroni-etal-2019-language) and further experiments could help distinguish their role in the retention behaviour of RE models.


We thank the anonymous reviewers. This work was performed while Bruno Taillé was employed by BNP Paribas and supported by the French Ministry for Research (CIFRE convention 2018/0327). We also thank the H2020 project AI4EU (825619) which partially supports Patrick Gallinari.


Appendix A Implementation Details

For every model, we use the original code associated with the papers with the default best performing hyperparameters unless stated otherwise. We run 5 runs on a single NVIDIA 2080Ti GPU for each of them on each dataset. For CoNLL04 and ACE05, we train each model with both the cased and uncased versions of BERT

and only keep the best performing setting.


Zhong2021AExtraction We use the approximation model and limit use a context window of 0 to only use the current sentence for prediction and be able to compare with other models. For ACE05, we use the standard bert-base-uncased LM but use the bert-base-cased version on CoNLL04 which results in a significant +2.4 absolute improvement in RE Strict micro F1 score.


Eberts2020Span-basedPre-training We use the original implementation as is with bert-base-cased for both ACE05 and CoNLL04 since the uncased version is not beneficial, even on ACE05 where there are fewer proper nouns. For the Ent-SpERT ablation, we simply remove the max-pooled context representation from the final concatenation in the RE module. This modifies the RE classifier’s input dimension from the original 2354 to 1586.

Two are better than one (TABTO)

wang-lu-2020-two We use the original implementation with bert-base-uncased for both ACE05 and CoNLL04 since the cased version is not beneficial on CoNLL04.

Appendix B Datasets Statistics

We present general datasets statistics in LABEL:table:data_stats. We also compute average values of some entity and relation attributes inspired by fu-etal-2020-interpretable and reported in LABEL:table:data_consistency. We report two of their entity attributes: entity length in number of tokens (eLen) and entity label consistency (eCon). Given a test entity mention, its label consistency is the number of occurrences in the training set with the same type divided by its total number of occurrences. It is zero for unseen mentions. Because eCon reflects both the ambiguity of labels for seen entities and the proportion of unseen entities, we propose to introduce the eCon* score that only averages label consistency of seen mentions and eLex, the proportion of entities with lexical overlap with the train set. We introduce similar scores for relations. Relation label consistency (rCon) extends label consistency for triples. Argument types label constitency (aCon) considers the labels of every pair of mentions of corresponding types in the training set. Because pairs of types are all seen during training we do not decompose aCon into aCon* and aLex. Argument length (aLen) is the sum of the lengths of the head and tail mentions. Argument distance (aDist) is the number of tokens between the head and the tail of a relation. We present a more complete report of overall Precision, Recall and F1 scores that can be interpreted in light of these statistics in LABEL:table:overlap_prf.