Label Verbalization and Entailment for Effective Zero- and Few-Shot Relation Extraction

09/08/2021 ∙ by Oscar Sainz, et al. ∙ UPV/EHU 0

Relation extraction systems require large amounts of labeled examples which are costly to annotate. In this work we reformulate relation extraction as an entailment task, with simple, hand-made, verbalizations of relations produced in less than 15 min per relation. The system relies on a pretrained textual entailment engine which is run as-is (no training examples, zero-shot) or further fine-tuned on labeled examples (few-shot or fully trained). In our experiments on TACRED we attain 63 relation (17 conditions), and only 4 points short to the state-of-the-art (which uses 20 times more training data). We also show that the performance can be improved significantly with larger entailment models, up to 12 points in zero-shot, allowing to report the best results to date on TACRED when fully trained. The analysis shows that our few-shot systems are specially effective when discriminating between relations, and that the performance difference in low data regimes comes mainly from identifying no-relation cases.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a context where two entities appear, the Relation Extraction (RE) task aims to predict the semantic relation (if any) holding between the two entities. Methods that fine-tune large pretrained language models (LM) with large amounts of labelled data have established the state of the art  

yamada-etal-2020-luke. Nevertheless, due to differing languages, domains and the cost of human annotation, there is typically a very small number of labelled examples in real-world applications, and such models perform poorly schick-schutze-2021-exploiting.

As an alternative, methods that only need a few examples (few-shot) or no examples (zero-shot) have emerged. For instance, prompt based learning proposes hand-made or automatically learned task and label verbalizations puri2019zeroShot; schick-schutze-2021-exploiting; schick2020small as an alternative to standard fine-tuning gao2020making; scao2021data. In these methods, the prompts are input to the LM together with the example, and the language modelling objective is used in learning and inference. In a different direction, some authors reformulate the target task (e.g. document classification) as a pivot task (typically question answering or textual entailment), which allows the use of readily available question answering (or entailment) training data yin-etal-2019-benchmarking; levy-etal-2017-zero. In all cases, the underlying idea is to cast the target task into a formulation which allows us to exploit the knowledge implicit in pre-trained LM (prompt-based) or general-purpose question answering or entailment engines (pivot tasks).

Prompt-based approaches are very effective when the label verbalization is given by one or two words (e.g. text classification), as they can be easily predicted by language models, but strive in cases where the label requires a more elaborate description, as in RE. We thus propose to reformulate RE as an entailment problem, where the verbalizations of the relation label are used to produce a hypothesis to be confirmed by an off-the-shelf entailment engine.

In our work111Code and splits available at: we have manually constructed verbalization templates for a given set of relations. Given that some verbalizations might be ambiguous (between city of birth and country of birth, for instance) we complemented them with entity type constraints. In order to ensure that the manual work involved is limited and practical in real-world applications, we allowed at most 15 minutes of manual labor per relation. The verbalizations are used as-is for zero-shot RE, but we also recast labelled RE examples as entailment pairs and fine-tune the entailment engine for few-shot RE.

The results on the widely used TACRED zhang2017tacred RE dataset in zero- and few-shot scenarios are excellent, well over state-of-the-art systems using the same amount of data. In addition our method scales well with large pre-trained LMs and large amounts of training data, reporting the best results on TACRED to date.

2 Related Work

Textual Entailment.

It was first presented by 10.1007/11736790_9 and further developed by  bowman-etal-2015-large who called it Natural Language Inference (NLI). Given a textual premise and hypothesis, the task is to decide whether the premise entails or contradicts (or is neutral to) the hypothesis. The current state-of-the-art uses large pre-trained LM fine-tuned in NLI datasets  Lan2020ALBERT:; roberta; conneau-etal-2020-unsupervised; lewis-etal-2020-bart; deberta.

Relation Extraction.

The best results to date on RE are obtained by fine-tuning large pre-trained language models equipped with a classification head.  joshi-etal-2020-spanbert pretrains a masked language model on random contiguous spans to learn span-boundaries and predict the entire masked span. LUKE yamada-etal-2020-luke further pretrains a LM predicting entities from Wikipedia, and using entity information as an additional input embedding layer. K-Adapter wang2020kadapter fixes the parameters of the pretrained LM and use Adapters to infuse factual and linguistic knowledge from Wikipedia and dependency parsing.

TACRED zhang2017tacred is the largest and most widely used dataset for RE in English. It is derived from the TAC-KBP relation set, with labels obtained via crowdsourcing. Although alternate versions of TACRED have been published recently alt-etal-2020-tacred; retacred, the state of the art is mainly tested in the original version.

Zero-Shot and Few-Shot learning.

brown2020language showed that task descriptions (prompts) can be fed into LMs for task-agnostic and few-shot performance. In addition, schick2020small; schick-schutze-2021-exploiting; tam2021AdaPET extend the method and allow finetuning of LMs on a variety of tasks. Prompt-based prediction treats the downstream task as a (masked) language modeling problem, where the model directly generates a textual response to a given prompt. The manual generation of effective prompts is costly and requires domain expertise. gao2020making provide an effective way to generate prompts for text classification tasks that surpasses the performance of hand picked ones. The approach uses few-shot training with a generative T5 model JMLR:v21:20-074 to learn to decode effective prompts. Similarly, liu2021gpt automatically search prompts in a embedding space which can be simultaneously fine-tuned along with the pre-trained language model. Note that previous prompt-based models run their zero-shot models on a semi-supervised setting in which some amount of labeled data is given in training. Prompts can be easily generated for text classification. Other tasks require more elaborate templates  goswami-etal-2020-unsupervised; li2021documentlevel and currently no effective prompt-based methods for RE exist.

Besides prompt-based methods, the use of pivot tasks has been widely use for few/zero-shot learning. For instance, relation and event extraction have been cast as a question answering problem  levy-etal-2017-zero; du-cardie-2020-event, associating each slot label to at least one natural language question. Closer to our work, NLI has been shown too to be a successful pivoting task for text classification yin-etal-2019-benchmarking; yin-etal-2020-universal; facebook_entailment; sainz-rigau-2021-ask2transformers. These works verbalize the labels, and apply an entailment engine to check whether the input text entails the label description.

In similar work to ours, the relation between entailment and RE was explored by obamuyide-vlachos-2018-zero. In their work they present some preliminary experiments where they cast RE as entailment, but only evaluate performance as binary entailment, not as a RE task. As a consequence they do not have competing positive labels and avoid RE inference and the issue of detecting no-relation.

Partially vs. fullly unseen labels in RE.

Existing zero/few-shot RE models usually see some labels during training (label partially unseen), which helps generalize to the unseen label levy-etal-2017-zero; obamuyide-vlachos-2018-zero; han-etal-2018-fewrel; chen2021zsbert. These approaches do not fully address the data scarcity problem. In this work we address the more challenging label fully unseen scenario.

Figure 1: General workflow of our entailment-based RE approach.

3 Entailment for RE

In this section we describe our models for zero- and few-shot RE.

3.1 Zero-shot relation extraction

We reformulate RE as an entailment task: given the input text containing the two entity mentions as the premise and the verbalized description of a relation as hypothesis, the task is to infer if the premise entails the hypothesis according to the NLI model. Figure 1 illustrates the main 3 steps of our system. The first step is focused on relation verbalization to generate the set of hypotheses. In the second we run the NLI model222We describe the NLI models in Section 4.3

and obtain the entailment probability for each hypothesis. Finally, based on the probabilities and the entity types, we return the relation label that maximizes the probability of the hypothesis, including the

no-relation label.

Verbalizing relations as hypothesis.

The hypotheses are automatically generated using a set of templates. Each template verbalizes the relation holding between two entity mentions. For instance, the relation per:date_of_birth can be verbalized with the following template: {subj}’s birthday is on {obj}. More formally, given the text that contains the mention of two entities (, ) and template , the hypothesis is generated by , which substitutes the subj and obj in the with the entities and , respectively333Note that the entities are given in a fixed order, that is the relation needs to hold between and in that order; the reverse ( and ) would be a different example.. Figure 1 shows four verbalizations for the given entity pair.

A relation label can be verbalized by one or more templates. For instance, in addition to the previous template, per:date_of_birth is also verbalized with {subj} was born on {obj}. At the same time, a template can verbalize more than one relation label. For example, {subj} was born in {obj} verbalizes per:country_of_birth and per:city_of_birth. In order to cope with such ambiguous verbalizations, we added the entity type information to each relation, e.g. COUNTRY and CITY for each of the relations in the previous example. 444Alternatively, one could think on more specific verbalizations, such as {subj} was born in the city of {obj} for per:city_of_birth. In the checks done in the available 15 min. such specific verbalizations had very low recall and were not finally selected.

We defined a function for every relation that checks the entity coherence between the template and the current relation label:

where and are the entity types of the first and second arguments, and are the set of allowed types for the first and second entities in relation . This function is used at inference time, to discard relations that do not match the given types. Appendix C lists all templates and entity type restrictions used in this work.

NLI for inferring relations.

In a second step we make use of the NLI model to infer the relation label. Given the text containing two entities and the system returns the relation from the set of possible relation labels with the highest entailment probability as follows:


The probability of each relation is computed as the probability of the hypothesis that yields the maximum entailment probability (Eq. 2), among the set of possible hypothesis. In case the two entities do not match the required entity types, the probability would be zero.


where is the entailment probability between the input text and the hypothesis generated by the template verbalizer. Although entailment models return probabilities for entailment, contradiction and neutral, just makes use of the entailment probability555The probabilities for relations defined in Eq. 2 are independent from each other, which, in a way, they could be easily extended to multi-label classification task.. The right hand-side of Figure 1 shows the application of NLI models and how the probability for each relation, , is computed.

Detection of no-relation.

In supervised RE, the no-relation case is taken as an additional label. In our case we examined two approaches.

In template-based detection we propose an additional template as if it was yet another relation label, and treated it as another positive relation in Eq. 1. The template for no-relation: {subj} and {obj} are not related.

In threshold-based detection we apply a threshold to in Eq. 2. If none of the relations surpasses the threshold, then our system returns no-relation. On the contrary, the model returns the relation label of highest probability (Eq. 1). When no development data is available, the threshold

is set to 0.5. Alternatively, we estimate

using the available development dataset, as described in the experimental part.

3.2 Few-Shot relation extraction

Our system is based on a NLI model which has been pretrained on annotated entailment pairs. When labeled relation examples exist, we can reformulate them as labelled NLI pairs, and use them to fine-tune the NLI model to the task at hand, that is, assigning highest entailment probability to the verbalizations of the correct relation, and assigning low entailment probabilities to the rest of the hypothesis (see Eq. 2).

Given a set of labelled relation examples, we use the following steps to produce labelled entailment pairs for fine-tuning the NLI model. 1) For each positive relation example we generate at least one entailment instance with the templates that describes the current relation. That is, we generate one or several premise-hypothesis pairs labelled as entailment. 2) For each positive relation example we generate one neutral premise-hypothesis instance, taken at random from the templates that do not represent the current relation. 3) For each negative relation example we generate one contradiction example, taken at random from the templates of the rest of relations.

If a template is used for the no-relation case, we do the following: First, for each no-relation example we generate one entailment example with the no-relation template. Then, for each positive relation example we generate one contradiction example using the no-relation template.

4 Experimental Setup

Train (Gold) Train (Silver) Development
# Pos # Neg # Pos # Neg # Pos # Neg
Scenario Split mean total total mean total total mean total total
Full training 100% 317.4 13013 55112 - - - 132.6 5436 17195
Zero-Shot No Dev - - - - - - 0 0 0
1% Dev - - - - - - 1.9 54 173
Few-Shot 1% 3.6 130 552 - - - 1.9 54 173
5% 16.3 651 2756 - - - 7.0 272 861
10% 32.6 1302 5513 - - - 13.6 544 1721
Data Augment. 0% 0 0 0 246.3 9850 41205 1.9 54 173
1% 3.6 130 552 246.3 9850 41205 1.9 54 173
5% 16.3 651 2756 246.3 9850 41205 7.0 272 861
10% 32.6 1302 5513 246.3 9850 41205 13.6 544 1721
Table 1: Statistics about the dataset scenarios based on TACRED used in the paper, including positive examples per relation, total amount of positive examples and the total amount of negative (no-relation) examples.

In this section we describe the dataset and scenarios we have used for evaluation, how we performed the verbalization process, the different pre-trained NLI models we have used and the state-of-the-art baselines that we compare with.

4.1 Dataset and scenarios

We designed three different low-resource scenarios based on the large-scale TACRED zhang2017tacred dataset. The full dataset consists of 42 relation labels, including the no-relation label, and each example contains the information about the entity type, among other linguistic information. The scenarios are described in Table 1 and are formed by different splits of the original dataset. We applied a stratified sampling method to keep the original label distribution.


The aim of this scenario is the evaluation of the models when no data is available for training. We present two different situations on this scenario: 1) no data is available for development (0% split) and 2) a small development set is available with around 2 examples per relation (1% split)666This setting is comparable to one where the examples in the guidelines are used as development.

. In this scenario the models are not allowed to train their own parameters but development data is used to adjust the hyperparameters.

MNLI No Dev ( 1% Dev
NLI Model # Param. Acc. Pr. Rec. F1 Pr. Rec. F1
ALBERTxxLarge 223M 90.8 32.6 79.5 46.2 55.2 58.1 56.6
RoBERTa 355M 90.2 32.8 75.5 45.7 58.5 53.1 55.6
BART 406M 89.9 39.0 63.1 48.2 60.7 46.0 52.3
DeBERTaxLarge 900M 91.7 40.3 77.7 53.0 66.3 59.7 62.8
DeBERTaxxLarge 1.5B 91.7 46.6 76.1 57.8 63.2 59.8 61.4
Table 2: Zero-Shot scenario results (Precision, Recall and F1) for our system using several pre-trained NLI models in two settings: no development (default threshold =0.5), and small development (1% Dev.) for setting

. In the leftmost columns we report the number of parameters and the accuracy in MNLI. For the 1% setting we report the median measures along with the F1 standard deviation in 100 runs.


This scenario presents the challenge of solving the RE task with just a few examples per relation. We present three settings commonly used in few-shot learning gao2020making 777The commonly reported value in few-shot scenarios is 16 examples per label. We also added the 3-8 and 32 examples settings in the evaluation.: around 4 examples per relation (1% of the training data in TACRED), around 16 examples per relation (5%) and around 32 examples per relation (10%). We reduced the development set following the same ratio.

Full Training.

In this setting we use all available training and development data.

Data Augmentation.

In this scenario we want to test whether a silver dataset produced by running our systems on untagged data can be used to train a supervised relation extraction system (cf. Section 3). In this scenario 75% of the training data in TACRED is set aside as unlabeled data888We use part of the original TACRED dataset to produce silver data in order not to introduce noise coming from different documents and/or pre-processing steps., and the rest of the training data is used in different splits (ranging from 1% to 10%). Under this setting we carried out two type of experiments: In the zero-shot experiments (0% in the table) we use our NLI based model to annotate the silver data and then fine-tune the RE model exclusively on the silver data. In the few-shot experiments the NLI model is first fine-tuned with the gold data, then used to annotate the silver data and finally the RE model is fine-tuned over both, silver and gold, annotations.

4.2 Hand-crafted relation templates

We manually created the templates to verbalize relation labels, based on the TAC-KBP guidelines which underlie the TACRED dataset. We limited the time for creating the templates of each relation to less than 15 minutes. Overall, we created 1-8 templates per relation (2 on average) (cf. Appendix C for full list).

The verbalization process consists of generating one or more templates that describe the relation and contain the placeholders {subj} and {obj}. The developer building the templates was given the task guidelines (brief description of the relation, including one or two examples and the type of the entities) and a NLI model (roberta-large-mnli checkpoint). For a given relation, he/she would create a template (or set of templates) and check whether the NLI model is able to output a high entailment probability for the template when applied on the guideline example(s). He/she could run this process for any new template that he/she could come up with. There was no strict threshold involved for selecting the templates, just the intuition of the developer. The spirit was to come up with simple templates quickly, and not to build numerous complex templates or to optimize entailment probabilities.

4.3 Pre-Trained NLI models

For our experiments we tried different NLI models that are publicly available with the Hugging Face Transformers wolf-etal-2020-transformers python library. We tested the following models which implement different architectures, sizes and pre-training objectives and were fine-tuned mainly over the MNLI williams-etal-2018-broad dataset999ALBERT was trained in some additional NLI datasets.: ALBERT Lan2020ALBERT:, RoBERTa roberta, BART lewis-etal-2020-bart and DeBERTa v2 deberta. Table 2 reports the number of parameters of these models. Further details on models can be found in Appendix A.

For each of the scenarios we have tested different models. In zero-shot and full training scenarios we compare all the pre-trained models using the templates described in Section 4.2. For few-shot we used RoBERTa for comparability, as it was used in state-of-the-art systems (cf. Section 4.4), and DeBERTa which is the largest NLI model available on the HUB101010 Finally, we only tested RoBERTa in data-augmentation experiments.

We ran 3 different runs on each of the experiments using different random seeds. In order to make a fair comparison with state-of-the-art systems (cf section 4.4.), we performed a hyperparameter exploration in the full training scenario, using the resulting configuration also in the zero/few-shot scenarios. We fixed the batch size at 32 for both RoBERTa and DeBERTa, and search the optimum learning-rate among on the development set. The best results were obtained using as learning-rate. For more detailed information refer to the Appendix B.

4.4 State-of-the-art RE models

We compared the NLI approach with the systems reporting the best results to date on TACRED: SpanBERT joshi-etal-2020-spanbert, K-Adapter wang2020kadapter and LUKE yamada-etal-2020-luke (cf. Section 2). In addition, we also report the results obtained by the vanilla RoBERTa baseline proposed by wang2020kadapter that serves as a reference for the improvements. We re-trained the different systems on each scenario setting using their publicly available implementations and best performing hyperparameters reported by the authors. All these models have a comparable number of parameters.

1% 5% 10%
Model Pr. Rec. F1 Pr. Rec. F1 Prec. Rec. F1
SpanBERT 0.0 0.0 0.0 36.3 23.9 28.8 3.2 1.1 1.6
RoBERTa 56.8 4.1 7.7 52.8 34.6 41.8 61.0 50.3 55.1
K-Adapter 73.8 7.6 13.8 56.4 37.6 45.1 62.3 50.9 56.0
LUKE 61.5 9.9 17.0 57.1 47.0 51.6 60.6 60.6 60.6
NLIRoBERTa (ours) 56.6 55.6 56.1 60.4 68.3 64.1 65.8 69.9 67.8
NLIDeBERTa (ours) 59.5 68.5 63.7 64.1 74.8 69.0 62.4 74.4 67.9
Table 3: Few-shot scenario results with 1%, 5% and 10% of training data. Precision, Recall and F1 score (standard deviation) of the median of 3 different runs are reported. Top four rows for third-party RE systems run by us.

5 Results

5.1 Zero-Shot

Figure 2:

Zero-shot scenario results. Mean F1 and standard error scores when setting

on increasing number of development examples.

Table 2 shows the results for different pre-trained NLI models, as well as the number of parameters and the MNLI matched accuracy. These results were obtained by using the threshold for negative relations, as we found that it works substantially better than the no-relation template alternative (cf. Section 3.1). For instance, RoBERTa yields an F1 of 30.1111111Results ommitted from Table 2 for brevity. well below the 45.7 when using the default threshold (). Overall we see an excellent zero-shot performance across all the models and settings proving that the approach is robust and model agnostic.

Regarding pre-trained models, the best F1 scores are obtained by the two DeBERTa v2 models, which also score the best on the MNLI dataset. Note that all the models achieve similar scores on MNLI, but small differences in MNLI result in large performance gaps when they come to RE, e.g. the 1.5 difference in MNLI between RoBERTa and DeBERTa becomes 7 points in No Dev. and 1% Dev. We think the larger differences in RE are due to the generalization ability of some of the larger models to domain and task differences.

The table includes the results for different values of the hyperparameter. In the most challenging setting, with default , the results are worst, with at most 57.8 F1. However, using as few as 2 examples per relation in average (1% Dev. setting) the results improve significantly.

We performed further experiments using larger amounts of development data to tune . Figure 2 shows that, for all models, the most significant improvement occurs at the interval [0%, 1%) and that the interval [1%, 100%] is almost flat. The best results with all development data is 63.4%, only 0.6 points better than using 1% of development. These results show clearly that a small number of examples suffice to set an optimal threshold.

5.2 Few-Shot

Table 3 shows the results of competing RE systems and our systems on the few-shot scenario. We report the median and standard deviation across 3 different runs. The competing RE methods suffer a large performance drop, specially for the smallest training setting. For instance, the SpanBERT system joshi-etal-2020-spanbert has difficulties to converge, even with the 10% of data setting. Both K-Adapter wang2020kadapter and LUKE yamada-etal-2020-luke improve over the RoBERTa system wang2020kadapter in all three settings, but they are well below our NLIRoBERTa system, with improvements of 48, 22 and 13 points against the baseline in each setting. We also report our method based on DeBERTaxLarge, which is specially effective in the smaller settings.

We would like to note that the zero-shot NLIRoBERTa system (1% Dev) is comparable in terms of F1 score to a vanilla RoBERTa trained with 10% of the training data. That is, 54 templates (10.5 hours, plus 23 development examples are roughly equivalent to 6800 annotated examples121212Unfortunately we could not find the time estimates for annotating examples. for training (plus 2265 development) .

5.3 Full training

Model Pr. Rec. F1
SpanBERT 70.8 70.9 70.8
RoBERTa 70.2 72.4 71.3
K-Adapter 70.1 74.0 72.0
LUKE 70.4 75.1 72.7
NLIRoBERTa (ours) 71.6 70.4 71.0
NLIDeBERTa (ours) 72.5 75.3 73.9
Table 4: Full training results (TACRED). Top four rows for third-party RE systems as reported by authors.

Some zero-shot and few-shot systems are not able to improve results when larger amounts of training data are available. Table 4 reports the results when the whole train and development datasets are used, which is comparable to official results on TACRED. Focusing on our NLIRoBERTa system, and comparing it to the results in Table 3, we can see that it is able to effectively use the additional training data, improving from 67.9 to 71.0. When compared to a traditional RE system, it performs on a par to RoBERTa, and a little behind K-Adapter and LUKE, probably due to the infused knowledge which our model is not using. These results show that our model keeps improving with additional data and that it is competitive when larger amounts of training is available. The results of NLIDeBERTa show that our model can benefit from larger and more effective pre-trained NLI systems even in full training scenarios, and in fact achieves the best results to date on the TACRED dataset.

5.4 Data augmentation results

Model 0% 1% 5% 10%
RoBERTa - 7.7 41.8 55.1
+ Zero-Shot DA 56.3 58.4 58.8 59.7
+ Few-Shot DA - 58.4 64.9 67.7
Table 5: Data Augmentation scenario results (F1) for different gold training sizes. Silver annotations by the zero-shot and few-shot NLIRoBERTa model.

In this section we explore whether our NLI-based system can produce high-quality silver data which can be added to a small amount of gold data when training a traditional supervised RE system, e.g. the RoBERTa baseline wang2020kadapter. Table 5 reports the F1 results on the data augmentation scenario for different amounts of gold training data. Overall, we can see that both our zero-shot and few-shot methods131313The zero-shot 1% Dev model is used in all data augmentation experiments, while the few-shot method changes to use the available data at each run (1%, 5% and 10%), both with RoBERTa provide good quality silver data, as they improve significantly over the baseline in all settings. Although the zero-shot and few-shot methods yield the same result with 1% of training data, the few-shot model is better in the rest of training regimes, showing that it can effectively use the available training data in each case to provide better quality silver data. If we compare the results in this table with those of the respective NLI-based system with the same amount of gold training instances (Tables 2 and 3) we can see that the results are comparable, showing that our NLI-based system and a traditional RE system trained with silver annotations have comparable performance. A practical advantage of a traditional RE system trained with our silver data is that is easier to integrate on available pipelines, as one just needs to download the trained Transformer model. It also makes it easy to check additive improvements in the RE method.

6 Analysis

Model Scenario P PvsN
NLIDeBERTa Zero-Shot No Dev 85.6 59.5
1% Dev 85.6 67.7
Few-Shot 5% 89.7 74.5
Full train - 92.2 77.8
LUKE Few-Shot 5% 69.3 63.4
Full train - 90.2 77.3
Table 6: Performance of selected systems and scenarios on two metrics: the binary task of detecting a positive relation vs. no-relation (PvsN column, F1) and detecting the correct relation among positive cases (P, F1).

Relation extraction can be analysed according to two auxiliary metrics: the binary task of detecting a positive relation vs. no-relation, and the multi-class problem of detecting which relation holds among positive cases (that is, discarding no-relation instances from test data). Table 6 shows the results of a selection of systems and scenarios. The first rows compare the performance of our best system, NLIDeBERTa, across four scenarios, while the last two rows show the results for LUKE in two scenarios. The zero-shot No dev. system is very effective when discriminating the relation among positive examples (P column), only 7 points below the fully trained system, while it lags well behind when discriminating positive vs. negative, 18 points. The use of a small development data for tuning the threshold closes the gap in PvsN, as expected, but the difference is still 10 points. All in all, these numbers show that our zero-shot system is very effective discriminating among positive examples, but that it still lags behind when detecting no-relation cases. Overall, the figures show the effectiveness of our methods in low data scenarios on both metrics.

Figure 3: Confusion matrix of our NLIDeBERTa zero-shot system on the development dataset. The rows represent the true labels and the columns the predictions. The matrix is rowise normalized (recall in the diagonal).

Confusion analysis

In supervised models some classes (relations) are better represented in training than others, usually due to data imbalance. Our system instead, represents each relations as a set of templates, which at least on a zero-shot scenario, should not be affected by data imbalance. The strong diagonal in the confusion matrix (Fig. 3) shows that our the model is able to discriminate properly between most of the relations (after all it achieves 85.6% accuracy, cf. Table 6), with exception of the no-relation column, which was expected. Regarding the confusion between actual relations, most of them are about overlapping relations, as expected. For instance, org:member_of and org:parents both involve some organization A being part or member of some other organization B, where org:members is different from org:parents in that correct fillers are distinct entities that are generally capable of autonomously ending their membership with the assigned organization141414Description extracted from the guidelines.. Something similar occurs between org:members and org:subsidiaries. Another reason for confusion happens when two or more relations exist concurrently, as in per:origin, per:country_of_birth and per:country_of_residence. Finally, the model scores low on per:other_family, which is a bucket of many specific relations where only a handful were actually covered by the templates.

7 Conclusions

In this work we reformulate relation extraction as an entailment problem, and explore to what extent simple hand-made verbalizations are effective. The creation of templates is limited to 15 minutes per relation, and yet allows for excellent results in zero- and few-shot scenarios. Our method makes effective use of available labeled examples, and together with larger LMs produces the best results on TACRED to date. Our analysis indicates that the main performance difference against supervised models comes from discriminating no-relation examples, as the performance among positive examples equals that of the best supervised system using the full training data. We also show that our method can be used effectively as a data-augmentation method to provide additional labeled examples. For the future we would like to investigate better methods for detecting no-relation in zero-shot settings.


Oscar is funded by a PhD grant from the Basque Government (PRE_2020_1_0246). This work is based upon work partially supported via the IARPA BETTER Program contract No. 2019-19051600006 (ODNI, IARPA), and by the Basque Government (IXA excellence research group IT1343-19).


Appendix A Pre-Trained models

The pre-trained NLI models we have tested from the Transformers library are the next:

  • ALBERT: ynie/albert-xxlarge-v2-snli_mnli _fever_anli_R1_R2_R3-nli

  • RoBERTa: roberta-large-mnli

  • BART: facebook/bart-large-mnli

  • DeBERTa v2 xLarge: microsoft/deberta-v2-xlarge-mnli

  • DeBERTa v2 xxLarge: microsoft/deberta-v2-xxlarge-mnli

Appendix B Experimental details

We carried out all the experiments on a single Titan V (16GB) except for the fine-tuning of DeBERTa, that has been done on a cluster of 4 Titan V100 (32GB). The average inference time for the zero and few-shot experiments is between 1h and 1.5h. The time needed for fine-tuning the NLI systems was at most 2.5h for RoBERTa and 5h for DeBERTa. All the experiments were done with mixed precision to speed up the overall runtime.

The whole hyperparameter settings used for fine-tuning NLIRoBERTa and NLIDeBERTa are listed below:

  • Train epochs:


  • Warmup steps: 1000

  • Learning-rate: 4e-6

  • Batch-size: 32

  • FP16 training

  • Seeds: {0, 24, 42}

Note that we are fine-tuning an already trained NLI system so we kept the number of epochs and learning-rate low. The rest of state-of-the-art systems were trained using the hyperparameters reported by the authors.

Appendix C TACRED templates

This section describes the templates used in the TACRED experiments. We performed all the experiments using the templates showed in Tables 1 (for PERSON relations) and 2 (for ORGANIZATION relations). These templates were manually created based on the TAC KBP Slot Descriptions151515 (annotation guidelines). Besides the templates, we also report the valid argument types that are accepted on each relation.

Relation Templates Valid argument types
per:alternate_names {subj} is also known as {obj} PERSON, MISC
per:date_of_birth {subj}’s birthday is on {obj} DATE
{subj} was born on {obj}
per:age {subj} is {obj} years old NUMBER, DURATION
per:country_of_birth {subj} was born in {obj} COUNTRY
per:stateorprovince_of_birth {subj} was born in {obj} STATE_OR_PROVINCE
per:city_of_birth {subj} was born in {obj} CITY, LOCATION
per:origin {obj} is the nationality of {subj} NATIONALITY, COUNTRY, LOCATION
per:date_of_death {subj} died in {obj} DATE
per:country_of_death {subj} died in {obj} COUNTRY
per:stateorprovince_of_death {subj} died in {obj} STATE_OR_PROVINCE
per:city_of_death {subj} died in {obj} CITY, LOCATION
per:cause_of_death {obj} is the cause of {subj}’s death CAUSE_OF_DEATH
per:countries_of_residence {subj} lives in {obj} COUNTRY, NATIONALITY
{subj} has a legal order to stay in {obj}
per:statesorprovinces_of_residence {subj} lives in {obj} STATE_OR_PROVINCE
{subj} has a legal order to stay in {obj}
per:city_of_residence {subj} lives in {obj} CITY, LOCATION
{subj} has a legal order to stay in {obj}
per:schools_attended {subj} studied in {obj} ORGANIZATION
{subj} graduated from {obj}
per:title {subj} is a {obj} TITLE
per:employee_of {subj} is a member of {obj} ORGANIZATION
per:religion {subj} belongs to {obj} RELIGION
{obj} is the religion of {subj}
{subj} believe in {obj}
per:spouse {subj} is the spouse of {obj} PERSON
{subj} is the wife of {obj}
{subj} is the husband of {obj}
per:children {subj} is the parent of {obj} PERSON
{subj} is the mother of {obj}
{subj} is the father of {obj}
{obj} is the son of {subj}
{obj} is the daughter of {subj}
per:parents {obj} is the parent of {subj} PERSON
{obj} is the mother of {subj}
{obj} is the father of {subj}
{subj} is the son of {obj}
{subj} is the daughter of {obj}
per:siblings {subj} and {obj} are siblings PERSON
{subj} is brother of {obj}
{subj} is sister of {obj}
per:other_family {subj} and {obj} are family PERSON
{subj} is a brother in law of {obj}
{subj} is a sister in law of {obj}
{subj} is the cousin of {obj}
{subj} is the uncle of {obj}
{subj} is the aunt of {obj}
{subj} is the grandparent of {obj}
{subj} is the grandmother of {obj}
{subj} is the grandson of {obj}
{subj} is the granddaughter of {obj}
per:charges {subj} was convicted of {obj} CRIMINAL_CHARGE
{obj} are the charges of {subj}
Table 1: Templates and valid arguments for PERSON relations.
Relation Templates Valid argument types
org:alternate_names {subj} is also known as {obj} ORGANIZATION, MISC
org:political/religious_affiliation {subj} has political affiliation with {obj} RELIGION, IDEOLOGY
{subj} has religious affiliation with {obj}
org:top_memberts/employees {obj} is a high level member of {subj} PERSON
{obj} is chairman of {subj}
{obj} is president of {subj}
{obj} is director of {subj}
org:number_of_employees/members {subj} employs nearly {obj} people NUMBER
{subj} has about {obj} employees
org:members {obj} is member of {subj} ORGANIZATION, COUNTRY
{obj} joined {subj}
org:subsidiaries {obj} is a subsidiary of {subj} ORGANIZATION, LOCATION
{obj} is a branch of {subj}
org:parents {subj} is a subsidiary of {obj} ORGANIZATION, COUNTRY
{subj} is a branch of {obj}
org:founded_by {subj} was founded by {obj} PERSON
{obj} founded {subj}
org:founded {subj} was founded in {obj} DATE
{subj} was formed in {obj}
org:dissolved {subj} existed until {obj} DATE
{subj} disbanded in {obj}
{subj} dissolved in {obj}
org:country_of_headquarters {subj} has its headquarters in {obj} COUNTRY
{subj} is located in {obj}
org:stateorprovince_of_headquarters {subj} has its headquarters in {obj} STATE_OR_PROVINCE
{subj} is located in {obj}
org:city_of_headquarters {subj} has its headquarters in {obj} CITY, LOCATION
{subj} is located in {obj}
org:shareholders {obj} holds shares in {subj} ORGANIZATION, PERSON
org:website {obj} is the URL of {subj} URL
{obj} is the website of {subj}
Table 2: Templates and valid arguments for ORGANIZATION relations.