An Empirical Study on Few-shot Knowledge Probing for Pretrained Language Models

by   Tianxing He, et al.
NYU college

Prompt-based knowledge probing for 1-hop relations has been used to measure how much world knowledge is stored in pretrained language models. Existing work uses considerable amounts of data to tune the prompts for better performance. In this work, we compare a variety of approaches under a few-shot knowledge probing setting, where only a small number (e.g., 10 or 20) of example triples are available. In addition, we create a new dataset named TREx-2p, which contains 2-hop relations. We report that few-shot examples can strongly boost the probing performance for both 1-hop and 2-hop relations. In particular, we find that a simple-yet-effective approach of finetuning the bias vectors in the model outperforms existing prompt-engineering methods. Our dataset and code are available at <>.


page 1

page 2

page 3

page 4


Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge

Cant is important for understanding advertising, comedies and dog-whistl...

Few-shot Knowledge Graph-to-Text Generation with Pretrained Language Models

This paper studies how to automatically generate a natural language text...

Adapting Meta Knowledge Graph Information for Multi-Hop Reasoning over Few-Shot Relations

Multi-hop knowledge graph (KG) reasoning is an effective and explainable...

To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment

There has been mounting evidence that pretrained language models fine-tu...

Eliciting Knowledge from Pretrained Language Models for Prototypical Prompt Verbalizer

Recent advances on prompt-tuning cast few-shot classification tasks as a...

Can Large Language Models Truly Understand Prompts? A Case Study with Negated Prompts

Previous work has shown that there exists a scaling law between the size...

VIPHY: Probing "Visible" Physical Commonsense Knowledge

In recent years, vision-language models (VLMs) have shown remarkable per...

1 Introduction

Large-scale unsupervised pretraining Peters et al. (2018); Devlin et al. (2018); Song et al. (2019); Yang et al. (2019); Liu et al. (2019)

of language models (LMs) has been shown to greatly boost the performance of a wide variety of natural language processing (NLP) tasks. It is interesting to wonder how much world knowledge is embedded within these pretrained LMs. In the LAMA benchmark

(Petroni et al., 2019), templates (e.g., “[X] was born in <mask>

.”) are used to create natural-language prompts for 1-hop relations in an existing knowledge graph. The accuracy of whether the model can predict the right object token is treated as a proxy for how knowledgeable the pretrained LM is. This line of investigation

(Poerner et al., 2020; Kassner and Schütze, 2020; Kassner et al., 2021; Heinzerling and Inui, 2021) also points to an exciting potential application of using a pretrained LM as an implicit knowledge base.

Unfortunately, the zero-shot performance of the manually created templates from LAMA is low. For example, the BERT-large model only has around 30% accuracy on the T-REx dataset. In our preliminary examinations, we found that in many error cases, the LM predicts the wrong type of objects. We illustrate this in Figure 1.

Motivated by this observation, in this work we explore few-shot knowledge probing, where only a small number (e.g., 10 or 20) of example triples are available to tune the prompts or model for better performance. This setting is attractive because: (1) Intuitively, a few examples are usually enough for humans to infer the precise relation type of interest; (2) Few-shot examples enable us to probe for new or rare relation types.

Figure 1: Few-shot examples can potentially correct the model’s prediction in knowledge probing.

In our experiments, we conduct a comprehensive comparison of different approaches in the context of few-shot knowledge probing. We briefly summarize our contributions as follows: (1) We create a new knowledge probing dataset named TREx-2p, which contains more challenging 2-hop relations. (2) For both 1-hop and 2-hop relations, few-shot examples strongly boost the knowledge probing performance for a pretrained LM. In particular, we find that a simple-yet-effective approach of finetuning the bias vectors in the model outperforms existing prompt-engineering methods.

2 Few-shot Knowledge Probing

We begin by establishing notations. We denote the parameters of a pretrained masked language model as , and the vocabulary as . For each relation type , the probing dataset has a set of knowledge-base triples , where and refers to the subject and object, respectively. Since we are considering a few-shot setting, each is split to and , where only contain a small number (e.g., 10 or 20) of triples. Most of our approaches involve hyper-parameter tuning, in which case we further split into and . can also be used to prevent over-fitting (via early stopping).

The task of LM knowledge probing (Petroni et al., 2019) is to query a pretrained LM for , by feeding it information about and . To do so, a converter function (to be described below) will be used to convert and into a query sentence with exactly one mask token in it, which is then fed into the LM. We denote the model’s output distribution for the masked token as , and the performance is reflected by the rank of in that distribution.

Next, we review available template options, and describe approaches which utilize the available few-shot training and development data to improve the performance of probing. Concrete examples are shown in Table 1.

manT/mineT Andrea Alciato was born in <mask> . / Andrea Alciato lived in <mask> .
defT Andrea Alciato => <mask> .
manT+ctx Joan Dickson was born in Edinburgh. Charles Helou was born in Beirut. Andrea Alciato was born in <mask>.
optiPrompts Andrea Alciato <V0> <V1> <V2> <V3> <V4> <mask>
optiP+manT Andrea Alciato <V0>:=was <V1>:=born <V2>:=in <mask> <V3>:=.
Table 1: Examples of how different types of converters form a input for the masked language model. The relation is “place of birth”, and the being queried is Andrea Alciato. The few-shot examples consists of and , which can be used for in-context learning.

Template Options

In Petroni et al. (2019), the converter function (denoted by ) is implemented via manually created templates, that are hand-crafted for each relation type. In Jiang et al. (2020), mining or paraphrasing-based methods are used to automatically find alternative templates for a given relation. They released the generated templates by the name LPAQA (LM Prompt And Query Archive), which are in the same format as in Petroni et al. (2019). For each relation , we select the best-performing template by comparing the performance of each template on , and use it for the convert function denoted by .

Both manT and mineT require human labor or external resources, therefore in the few-shot context setting we consider, it is reasonable to question whether such manual work is necessary. To explore this question, we follow Brown et al. (2020) and create a default template of “[X] => [Y]”, which can be applied for any relation type. We denote it by defT, and it can be used in the in-context learning approach described next.

In-context Learning

As shown by (Brown et al., 2020), pretrained LMs are able to learn from the examples included in the input. To implement this approach, we concatenate converted triples from to be a long prefix, and prepend it to our queries. We denote it by , where is a placeholder for the template option (e.g., manT). In our experiments we find that the order of the prefix examples will affect the performance. Therefore for each relation type, we tune the ordering as a hyper-parameter via .

Optimized Prompts

It is attractive to think of approaches which can automatically design prompts, minimizing human effort. AutoPrompt (Shin et al., 2020) and BERTese (Haviv et al., 2021) use gradient-based search to automatically find templates, in the form of discrete tokens, that maximize the model’s performance on a training set. Very recently, OptiPrompt (Zhong et al., 2021) generalizes to continuous vectors, and achieves better performance than AutoPrompt. In OptiPrompt, five relation vectors are put between the subject and the mask token, before being fed into the model.111We have also tried with 8 or 10 relation vectors, but only observe very little improvements.

These relation vectors are trained to minimize the cross-entropy loss for the object, with stochastic gradient descent (SGD):


By default, we initialize the relation vectors to be the mean of the input embeddings of the first 10,000 most frequent tokens that are stored in the pretrained LM. We could also align the relation vectors with the manual template, and initialize them to be the embedding of the corresponding token in the template (denoted by optiP+manT).

These studies utilize a considerable number of example triples (around 1000 samples per relation type) to train the prompts, and their performance under a few-shot setting is unknown.

Model Finetuning

All the approaches discussed above engineer the input while the pretrained LM is kept fixed. Therefore, it is natural to consider finetuning the model with the available templates or relation vectors as input. The major shortcoming is that we would need to store a copy of the entire model for each relation type (Lester et al., 2021; Li and Liang, 2021).

Model Bias Finetuning

To mitigate the storage issue, Ben-Zaken et al. (2020) proposes to finetune only the bias vectors in the encoder. This approach is named BitFit, and is shown to be very competitive on the GLUE benchmark (Wang et al., 2018). Further details and a storage cost comparison are given in Appendix A. In our experiments, we test its performance under few-shot knowledge probing, and compare it with full-model finetuning.

3 Datasets

Following Zhong et al. (2021) and Shin et al. (2020), we use the T-REx (Elsahar et al., 2018) dataset, which is included in the LAMA benchmark (Petroni et al., 2019). It contains 41 Wikidata relation types, and each relation type has up to 1000 triples. We will refer to it as TREx-1p as it focuses on 1-hop relations.

In addition to memory of 1-hop relations, humans also possess the capability of multi-hop reasoning (Yang et al., 2018; Xiong et al., 2017). For example, given two known facts of "[X] works for [Y]." and "[Y] produces [Z].", there is clearly a 2-hop link between X and Z (e.g., X being Steve Jobs, Y being Apple, and Z being iPhone). To probe whether the pretrained LM also possesses this kind of “indirect” knowledge, we create a 2-hop variant of the T-REx dataset, named TREx-2p.222We will release the data and code used in this work in the public version of this manuscript. We manually examine the 2-hop link existing in the knowledge graph of TREx-1p, and select eight 2-hop relation types that make sense to humans.

As in LAMA, we manually create natural-language templates for relations in TREx-2p. We show them in Table 4 (Appendix B). To encode the 2-hop relations, these templates are syntactically more complicated (e.g., “[X] works for a company that developed [Y] .”). Therefore, we expect the zero-shot probing performance of TREx-2p with manual templates to be low.

4 Experiments

Our experiments focus on the Roberta-large model (Liu et al., 2019), a 24-layer transformer LM with a hidden-dimension of 1024. Our code is based on HuggingFace (Wolf et al., 2020) and the released code from LAMA. The few-shot development set () are used for hyper-parameter tuning. We find that finetuning with the few-shot training examples are very prone to over-fitting. Therefore, during SGD finetuning we do early stopping by monitoring the loss on the development set every 10 iterations. More details are in Appendix A.

In addition to accuracy (Precision@1), we also report mean reciprocal rank (MRR), to account for cases with multiple valid targets. Following earlier work (Petroni et al., 2019), we report macro-averaged numbers across different relation types.

The few-shot examples are randomly selected from the dataset for each relation type, and the rest of the samples are used for evaluation. We compare the performance of different approaches under settings where 10/20/40 example triples are available. Out of the available examples, 5/10/10 samples are taken out as a development set, leaving the rest for training. The same training/development sets are used across different approaches.

Accuracy(%) Prompt Engineering In-context Learning Model FT BitFit
TREx-1p manT mineT optiP optiP+manT manT defT mineT manT defT manT defT optiP+manT
5T+5D 0-shot: 25.8 34.9 40.0 49.4 49.0 47.3 48.9 49.1 44.8 49.2 45.4 49.8
10T+10D 36.3 47.9 49.7 50.3 51.1 51.6 51.3 49.4 52.4 48.9 52.1
30T+10D 37.0 52.3 52.5 50.0 52.1 51.0 54.1 53.2 54.5 53.3 54.0
TREx-2p manT / optiP optiP+manT manT defT / manT defT manT defT optiP
5T+5D 0-shot: 14.4 / 43.2 41.3 47.5 45.0 / 45.6 48.1 44.6 46.9 48.0
10T+10D / 50.1 46.7 44.0 44.0 / 50.1 48.9 51.4 51.5 50.1
30T+10D / 51.8 52.0 53.0 50.3 / 53.5 54.2 53.6 53.5 55.7
Table 2: The accuracy performance of different approaches for the TREx-1/2p datasets. “5T+5D” means that 5 examples are used for training and 5 examples are used as a development set. Some combination of approaches (such as model finetuning with OptiPrompt) are deferred to Appendix B due to lack of space.

The accuracy results are shown in Table 2. Observations from results measured by MRR are highly similar, and we defer them to Table 5 (Appendix B) to save space. In general, we observe that for both 1-hop and 2-hop relations, large gains can be achieved with as few as 10 available examples in comparison to the zero-shot performance.

For prompt engineering, OptiPrompt greatly outperforms manual or LPAQA (mineT) templates, which agrees with the non-few-shot results in Zhong et al. (2021). This confirms the advantage of a continuous prompt as opposed to discrete tokens. Next, in-context learning is competitive in the 10/20-shot setting. However, its performance saturates quickly, and is outperformed by OptiPrompt in the 40-shot setting.

Direct finetuning a large model with only a few training examples is usually considered difficult due to over-fitting. Interestingly, we find that early stopping with the tiny development set can effectively regularize the training, and model fintuning gives better accuracy than OptiPrompt in most cases. More excitingly, BitFit, which only tunes the bias parameters, achieves similar or even better accuracy than full-model finetuning. In some cases, BitFit can benefit from OptiPrompt as input for the extra flexibility. Lastly, we observe that manual templates perform better than the default template for OptiPrompt, model finetuning and BitFit, showing a complementary effect to the few-shot examples.

Will more examples give better performance? In Appendix B, we show that the performance saturates at around 200 examples with an accuracy of 57.5%.

Figure 2: (TREx-1p) Only finetuning the bias parameter in the output layer or the final hidden layer gives worse performance than BitFit or OptiPrompt.

For TREx-2p, the general observations are similar to TREx-1p. We mention two differences: (1) The zero-shot performance of manual templates for TREx-2p is poor (only 14.4%), which is expected as 2-hop templates are syntactically more complicated; (2) Possibly due to the same reason, OptiPrompt does not benefit from manual-template initialization for TREx-2p.

Finally, we introduce two control baselines for BitFit: (1) Only the length- final bias vector in the output layer is finetuned; (2) Only the bias vector in the final hidden layer is finetuned. Results are shown in Figure 2. We observe that both control baselines are outperformed by BitFit and OptiPrompt by a large margin. This shows that the performance gain is not only from simply biasing the model to a certain group of output tokens, and the inner representations are also changed to better expose the stored knowledge.

5 Related Works

How to effectively adapt a pretrained LM to a specific task in a few-shot setting has been an important topic in recent NLP research (Zhang et al., 2021). The idea of in-context learning is popularized by GPT-3 (Brown et al., 2020), which shows that a fixed pretrained LM can be primed to conduct different tasks via in-context examples. Recently, Zhao et al. (2021) points out some caveats about in-context learning, and how to better calibrate it.

Closely related to template-based knowledge probing, Schick and Schütze (2021) proposes Pattern Exploiting Training (PET) for NLU tasks, where inputs are converted into cloze-style questions (e.g., “Awful pizza! It was <mask>.”), and gradient-based optimization is conducted. PET and its variants iPET and ADAPET (Schick and Schütze, 2020; Tam et al., 2021) are shown to be more effective than vanilla in-context learning. Also along this line of work, Gao et al. (2020) propose an automatic framework of prompt generation and demonstration selection.

Last but not least, Li and Liang (2021), followed by Lester et al. (2021), propose prefix tuning, where continuous task-specific input vectors are tuned while the model is kept fixed. It is very similar to the OptiPrompt approach considered in this work.


  • E. Ben-Zaken, S. Ravfogel, and Y. Goldberg (2020) BitFit: simple parameter-efficient fine-tuning for transformer-based masked language-models. Cited by: Appendix A, §2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Link Cited by: §2, §2, §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1.
  • H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, and E. Simperl (2018) T-REx: a large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §3.
  • T. Gao, A. Fisch, and D. Chen (2020) Making pre-trained language models better few-shot learners. CoRR abs/2012.15723. External Links: Link, 2012.15723 Cited by: §5.
  • A. Haviv, J. Berant, and A. Globerson (2021) BERTese: learning to speak to BERT. CoRR abs/2103.05327. External Links: Link, 2103.05327 Cited by: §2.
  • B. Heinzerling and K. Inui (2021) Language models as knowledge bases: on entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 1772–1791. External Links: Link Cited by: §1.
  • Z. Jiang, F. F. Xu, J. Araki, and G. Neubig (2020) How can we know what language models know?. Transactions of the Association for Computational Linguistics 8, pp. 423–438. External Links: Link, Document Cited by: §2.
  • N. Kassner, P. Dufter, and H. Schütze (2021) Multilingual LAMA: investigating knowledge in multilingual pretrained language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 3250–3258. External Links: Link Cited by: §1.
  • N. Kassner and H. Schütze (2020) Negated and misprimed probes for pretrained language models: birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7811–7818. External Links: Link, Document Cited by: §1.
  • B. Lester, R. Al-Rfou, and N. Constant (2021) The power of scale for parameter-efficient prompt tuning. CoRR abs/2104.08691. External Links: Link, 2104.08691 Cited by: §2, §5.
  • X. L. Li and P. Liang (2021) Prefix-tuning: optimizing continuous prompts for generation. CoRR abs/2101.00190. External Links: Link, 2101.00190 Cited by: §2, §5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1, §4.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, External Links: Link Cited by: Appendix A.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1.
  • F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel (2019) Language models as knowledge bases?. External Links: 1909.01066 Cited by: §1, §2, §2, §3, §4.
  • N. Poerner, U. Waltinger, and H. Schütze (2020) E-BERT: efficient-yet-effective entity embeddings for BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 803–818. External Links: Link, Document Cited by: §1.
  • T. Schick and H. Schütze (2020) It’s not just size that matters: small language models are also few-shot learners. CoRR abs/2009.07118. External Links: Link, 2009.07118 Cited by: §5.
  • T. Schick and H. Schütze (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 255–269. External Links: Link Cited by: §5.
  • T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh (2020) AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 4222–4235. External Links: Link, Document Cited by: §2, §3.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. Cited by: §1.
  • D. Tam, R. R. Menon, M. Bansal, S. Srivastava, and C. Raffel (2021) Improving and simplifying pattern exploiting training. CoRR abs/2103.11955. External Links: Link, 2103.11955 Cited by: §5.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In

    Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    Brussels, Belgium, pp. 353–355. External Links: Link, Document Cited by: §2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link, Document Cited by: §4.
  • W. Xiong, T. Hoang, and W. Y. Wang (2017)

    DeepPath: A reinforcement learning method for knowledge graph reasoning

    CoRR abs/1707.06690. External Links: Link, 1707.06690 Cited by: §3.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §1.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2369–2380. External Links: Link, Document Cited by: §3.
  • T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi (2021) Revisiting few-sample {bert} fine-tuning. In International Conference on Learning Representations, External Links: Link Cited by: Appendix A, §5.
  • T. Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh (2021) Calibrate before use: improving few-shot performance of language models. CoRR abs/2102.09690. External Links: Link, 2102.09690 Cited by: §5.
  • Z. Zhong, D. Friedman, and D. Chen (2021) Factual probing is [MASK]: learning vs. learning to recall. CoRR abs/2104.05240. External Links: Link, 2104.05240 Cited by: §2, §3, §4.


Appendix A Implementation Details

For model or relation vector finetuning, we use the AdamW optimizer (Loshchilov and Hutter, 2019; Zhang et al., 2021). Since in most of our few-shot experiments, the training data only consists of a small number (e.g., 10 or 20) of samples, we directly do full-batch training. We tune the learning-rates in a log-scale using . Typically, we find that a small learning rate (e.g., 1e-06) works well for full-model finetuning, while a relatively large learning rate (e.g., 0.01) works well for OptiPrompt or BitFit. For in-context learning, we try 20 different random orders (of the examples in the context) for each relation type, and use the ordering which gives best performance on .

For the implementation of BitFit, we follow the BitFit- variant in Ben-Zaken et al. (2020), where around half of the bias vectors are tuned. To be specific, for each transformer layer, the bias for the attention query (of length 1024), and the bias for the intermediate layer (of length 4096) of the transformer block, are tuned. We summarize and compare the storage cost of different approaches in Table 3.

Finally, we mention two differences between our implementation and the code from the LAMA benchmark: (1) The original code uses a common vocab, which is a intersection of the vocabularies from various pretrained LMs. In this work since we focus on the Roberta-large model, we just use the whole Roberta vocab. If we switch to the common vocab, the zero-short accuracy on the TREx-1p dataset will be improved from 25.8% to 31.9%. (2) In the original code, inside each relation, if a subject has multiple valid objects, they are still treated as separate triples. As a consequence, for the accuracy metric, it is impossible for the LM get all triples right because only the top-1 prediction is considered. In our implementation, we merge them into one test case with multiple valid targets, and report both accuracy and MRR.

Appendix B Auxiliary Results

Templates created for TREx-2p are shown in Table 4. These 2-hop relations are manually selected from the knowledge graph of TREx-1p.

In Table 5, MRR performance of different approaches are shown. The observations are similar to the accuracy results (Table 2).

In Figure 3, accuracy results with more available samples are shown. We observe that the performance saturates at around 200 samples at around 57.5%. When the number of available samples is larger than or equal to 100, we use 20 samples for development.

In Table 6, a complete set of results of model fintuning/BitFit for the TREx-1/2p datasets are shown. OptiPrompt and the model (or the bias vectors in the model) can be jointly trained for the extra flexibility in the input. In some cases the performance is improved, but the gain is not large.

Approach Param. Number
In-context Leanring 0
OptiPrompt 5 1024
BitFit 24 5120
Model Finetuning  355M
FinalHiddenOffset 1024
VocabOffset |V| = 50325
Table 3: The number of extra parameters to be saved for each relation type.
Figure 3: (TREx-1p) Accuracy results with more available samples for each relation, the performance saturates at around 200 samples.
P159 The headquarter of [X] (Virtue Party) is in [Y] (Ankara) . | P1376 [X] (Ankara) is the capital of [Y] (Turkey) .
The headquarter of [X] (Virtue Party) is in the country of [Y] (Turkey) .
P108 [X] (Steve Jobs) works for [Y] (Apple) . | P178 [X] (macOS) is developed by [Y] (Apple) .
[X] (Steve Jobs) works for a company that developed [Y] (macOS) .
P178 [X] (macOS) is developed by [Y] (Apple) . | P178 [X] (MessagePad) is developed by [Y] (Apple) .
[X] (macOS) and [Y] (MessagePad) are developed by the same company .
P31 [X] (Wick Airport) is a [Y] (airport) . | P361 [X] (runway) is part of [Y] (airport) .
One component of [X] (Wick Airport) is [Y] (runway) .
P361 [X] (geometry) is part of [Y] (mathematics) . | P361 [X] (arithmetic) is part of [Y] (mathematics) .
[X] (geometry) and [Y] (arithmetic) are part of the same thing .
P361 [X] (whey) is part of [Y] (milk) . | P527 [X] (yogurt) consists of [Y] (milk) .
[X] (whey) is a low-level part of [Y] (yogurt) .
P527 [X] (gelato) consists of [Y] (milk) . | P527 [X] (yogurt) consists of [Y] (milk) .
[X] (gelato) and [Y] (yogurt) share at least one element .
P37 The official language of [X] (Scotland) is [Y] (English) . | P19 [X] (Paul Mounsey) was born in [Y] (Scotland) .
The official language of the country where [X] (Paul Mounsey) was born is [Y] (English) .
Table 4: Examples of the TREx-2p dataset, and the manual templates we created. The relation ids from the origin T-REx dataset are also shown.
MRR Prompt Engineering In-context Learning Model Finetuning BitFit
TREx-1p manT mineT optiP optiP+manT manT defT mineT manT defT manT defT optiP+manT
5T+5D 0-shot: .340 .436 .487 .572 .568 .559 .572 .576 .538 .577 .538 .580
10T+10D .450 .559 .577 .583 .596 .596 .594 .580 .600 .574 .600
30T+10D .458 .603 .608 .583 .603 .595 .626 .616 .625 .617 .625
TREx-2p manT / optiP optiP+manT manT defT / manT defT manT defT optiP
5T+5D 0-shot: .166 / 35.8 35.0 44.9 42.7 / 39.3 40.0 38.0 38.9 43.1
10T+10D / 42.6 42.3 41.3 41.2 / 44.6 43.4 46.5 44.3 43.3
30T+10D / 44.4 43.7 44.2 44.5 / 45.8 45.9 46.3 44.8 46.6
Table 5: The MRR performance of different approaches for the TREx-1/2p datasets. The leading zeros are omitted. The observations are similar to the accuracy results.
Accuracy(%) Model Finetuning Bitfit
TREx-1p manT defT mineT optiT optiT+manT manT defT mineT optiT optiP+manT
5T+5D 49.1 44.8 48.7 42.6 49.0 49.2 45.4 48.8 44.3 49.8
10T+10D 51.3 49.4 51.1 49.1 51.2 52.4 48.9 51.1 47.7 52.2
30T+10D 54.1 53.2 54.1 53.3 54.2 54.5 53.3 54.5 53.1 54.0
TREx-2p manT defT mineT optiT optiT+manT manT defT mineT optiT optiP+manT
5T+5D 45.6 48.1 / 44.3 46.4 44.6 46.9 / 48.0 45.3
10T+10D 50.1 48.9 / 50.0 48.8 51.4 51.5 / 50.1 50.6
30T+10D 53.5 54.2 / 52.5 53.6 53.6 53.5 / 55.7 53.6
MRR Model Finetuning Bitfit
TREx-1p manT defT mineT optiT optiT+manT manT defT mineT optiT optiP+manT
5T+5D .576 .538 .569 .514 .575 .577 .538 .570 .532 .580
10T+10D .594 .580 .590 .570 .594 .600 .574 .590 .562 .600
30T+10D .626 .616 .623 .613 .627 .625 .617 .626 .612 .625
TREx-2p manT defT mineT optiT optiT+manT manT defT mineT optiT optiP+manT
5T+5D 39.3 40.0 / 37.3 39.6 38.0 38.9 / 43.1 39.3
10T+10D 44.6 43.4 / 42.8 43.2 46.5 44.3 / 43.3 46.0
30T+10D 45.8 45.9 / 44.7 45.6 46.3 44.8 / 46.6 45.9
Table 6: A complete set of results of model fintuning and BitFit for the TREx-1/2p datasets.