How to Query Language Models?

08/04/2021 ∙ by Leonard Adolphs, et al. ∙ 0

Large pre-trained language models (LMs) are capable of not only recovering linguistic but also factual and commonsense knowledge. To access the knowledge stored in mask-based LMs, we can use cloze-style questions and let the model fill in the blank. The flexibility advantage over structured knowledge bases comes with the drawback of finding the right query for a certain information need. Inspired by human behavior to disambiguate a question, we propose to query LMs by example. To clarify the ambivalent question "Who does Neuer play for?", a successful strategy is to demonstrate the relation using another subject, e.g., "Ronaldo plays for Portugal. Who does Neuer play for?". We apply this approach of querying by example to the LAMA probe and obtain substantial improvements of up to 37.8 only 10 demonstrations–even outperforming a baseline that queries the model with up to 40 paraphrases of the question. The examples are provided through the model's context and thus require neither fine-tuning nor an additional forward pass. This suggests that LMs contain more factual and commonsense knowledge than previously assumed–if we query the model in the right way.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language Models (LM) are omnipresent in modern NLP systems. In just a few years, they’ve been established as the standard feature extractor for many different language understanding tasks (karpukhin2020dense; zhang2020retrospective; wang2019structbert; he2020deberta). Typically, they are used to create a latent representation of natural language input and then fine-tuned to the task at hand. However, recent work (lama; whatlmknow; gpt3; 2020arXiv200208910R) has shown that off-the-shelve language models capture not only linguistic features but also large amounts of relational knowledge, not requiring any form of re-training.

Figure 1: BERT’s top-3 predictions with probabilites when prompted with the cloze-style question (top) versus when prompted with one additional example of the same relation (bottom).

The LAMA probe by lama was designed to quantify the amount of relational knowledge present in (mask-based) language models. While the task of predicting the right object for a subject-relation tuple remains the same as for a standard knowledge base (KB) completion query, the input is structured in a cloze-style sentence. For example, a KB completion query of the form (Dante, born-in, X) becomes "Dante was born in [MASK].". lama show that BERT (devlin2019bert) performs on par with competitive specialized models on factual and commonsense knowledge. The performance on this task can only be seen as a lower bound to the actual knowledge present in language models as the choice of natural language template for a given relation might be suboptimal (lama; whatlmknow). The more general question here is "How to query an LM for a specific information need?". whatlmknow propose to use multiple paraphrases of the probe and then aggregate the solutions. petroni2020context, on the other hand, add relevant context. Both approaches can be linked to common human behavior. In human dialog, a question can be made more precise both by paraphrasing or adding additional context information. Since language models are trained on large amounts of human-generated data, the intuition of phrasing the information need most naturally

seems obvious. Humans excel at pattern recognition and pattern continuation for many different modes of representation

(doi:10.1007/BF02833890). Concepts embedded in language are no exception to this. Therefore, another common way to probe a human’s knowledge is by providing examples and asking them to transfer the relation provided to a new object. For example, asking Who plays Neuer for? is ambiguous as both Bayern Munich and Germany would be correct answers. However, when contextualizing the question with an example, the answer is clear: I know Ronaldo plays for Portugal. Who plays Neuer for?.

In this work, we apply the concept of querying by example to probe language models. Additional to the cloze-style question, we provide other examples of the same relation to the model’s input. The previous example’s input then becomes "Ronaldo plays for Portugal. Neuer plays for [MASK].". We show that by providing only a few demonstrations, standard language models’ prediction performance improves drastically. So much so that for the TREx dataset, it becomes an even more powerful technique to retrieve knowledge than using an ensemble of up to 40 different paraphrases (whatlmknow), while requiring only a single forward pass instead of 40.

2 Related Work

Language Model Probes

lama started to investigate how much factual and commonsense knowledge LMs posses. They released the LAMA probe, which is a dataset consisting of T-REx (trex), Google-RE, ConceptNet (speer2018conceptnet), and SQuAD (DBLP:journals/corr/RajpurkarZLL16). Each dataset is transformed to be a collection of subject, relation, object-triplets and pruned to only contain single token objects present in BERT’s vocabulary. Additionally, they provide templates in natural language for each relation. Their investigation reveals that BERT-large has remarkable capabilities in recalling factual knowledge, competitive to supervised baseline systems.
Since there is usually more than one way to express a relation, the LAMA probe score can only be regarded as a lower bound (lama; whatlmknow). To tighten this lower bound, whatlmknow propose an automatic discovering mechanism for paraphrases together with an aggregation scheme. By querying the LM with a diverse set of prompts, they significantly improve the LAMA probe’s baseline numbers for BERT models. However, this approach incurs the cost of additional queries to the LM, an optimization procedure to aggregate the results, and the extraction of paraphrases.
Machine reading comprehension (MRC) and open-domain question answering (QA) are fields in NLP dominated by large pre-trained LMs. Here, the premise typically is that the model is capable of extracting the answer from the provided context, rather than having it stored in its parameters111With the notable exception of the work of 2020arXiv200208910R, which uses a T-5 model without any access to an additional knowledge base.. petroni2020context extend this line of thought to retrieve factual knowledge from LMs by providing relevant context but without fine-tuning the model. Their experiments show that providing relevant passages significantly improves the scores on the LAMA probe for BERT models.

Few-Shot Learning

The term few-shot learning refers to the practice of only providing a few examples when training a model, compared to the typical approach of using large datasets (wang2020generalizing). In the NLP domain, recent work by gpt3 suggests to use these few examples only in the context, as opposed to actually training with it. Fittingly, they call this approach in-context learning. Here, they condition the model on a natural language description of the task together with a few demonstrations. Their experiments reveal that the larger the model, the better its in-context learning capabilities. Our approach is very similar to in-context learning, with the difference that we do not provide a description of the task and utilize natural language templates for the relations. The motivation is that this should closely resemble human behavior of providing examples of a relation: instead of providing a list of subject and objects and let the other person figure out the relation, a human typically provides the subject and objects embedded in the template relation. Moreover, we understand our approach not as a learning method, but rather as a querying technique that disambiguates the information need.
smalllmfewshot argue that small LMs can be effective for few-shot learning too. However, they approach the problem of limited examples differently; instead of providing it as conditioning in the input, they actually train with it. By embedding the data into relation templates, they obtain training data that is closer in style to the pre-training data and, thus, can learn with fewer samples. gao2020making take this concept even further and automate the template generation. Additionally, they also find that—when fine-tuning with few samples—providing good demonstrations in the context improves the model’s performance.

3 Background

3.1 Language Models for cloze-style QA

In this work, we probe mask-based language models for their relational knowledge. The considered facts are triplets consisting of a subject, a relation, and an object

. Language models are trained to predict the most probable word given the (surrounding) context. Hence, to test a model’s factual knowledge, we feed it natural text with the object masked out. This requires a mapping from the relation

to a natural language prompt with placeholders for subject and object, e.g., the relation = age becomes = [s] is [o] years old. When probing for a single -triplet, the input to the language model is the natural language prompt of the relation together with the subject . It outputs a likelihood score for each token in its vocabulary which we use to construct a top- prediction subset for the object :


The language model succeeds for the triplet @ if . For example, we say that it knows the fact Tiger Woods age @3, if for the query "Tiger Woods is [MASK] years old" it ranks the token "45" within the top-3 of the vocabulary.

3.2 Datasets

Corpus Relation Statistics
#Facts #Relations
Google-RE birth-place 2937 1
birth-date 1825 1
death-place 765 1
Total 5527 3
T-REx - 937 2
- 20006 23
- 13096 16
Total 34039 41
ConceptNet Total 11458 16
Table 1: Statistics for the corpora of the LAMA data.

We use the LAMA probe in our experiments (lama). It’s a collection of factual and commonsense examples provided as -triplets222

We do not consider the SQuAD dataset of the probe as it has no clear notion of

relation. with single token objects. Moreover, it provides human-generated templates for each relation . The statistics about the three considered corpora T-REx (trex), Google-RE333, and ConceptNet (speer2018conceptnet) are provided in Table 1.

3.3 Models

We investigate the usefulness of querying by example, for three individual language models: BERT-base, BERT-large (devlin2019bert), and ALBERT-xxl (lan2020albert). These models are among the most frequently used language models these days444According to the statistics from,masked-lm.. For both BERT models, we consider the cased variant, unless explicitly noted otherwise.

4 Method

Our proposed method for querying relational knowledge from LMs is simple yet effective. When we construct the query for the triplet , we provide the model with additional samples of the same relation . These additional examples are converted to their natural language equivalent using the template and prepend to the cloze-style sentence representation of . The intuition is that the non-masked examples provide the model with an idea of filling in the gap for the relation at hand. As can be seen in Figure 1, providing a single example in the same structure clarifies the object requested for both humans and BERT. This is particularly useful when the template does not capture the desired relation between subject and object unambiguously, which in natural language is likely to be the case for many relations. In this sense, it tries to solve the same problem as paraphrasing. A query is paraphrased multiple times to align the model’s understanding of the query with the actual information need. When we provide additional examples, we do the same by showing the model how to apply the relation to other instances and ask it to generalize. Of course, the model does not reason in this exact way; rather, through its training data, it is biased towards completing patterns as this is a ubiquitous behavior in human writing.

Query Predictions
No Example
  Rodmarton555A village in South West a  . farmer (3.9%)
businessman (2.5%)
Random Example
  M.S.I. Airport is a airport.
  Rodmarton is a  . town (16.9%)
village (14.7%)
Close Example
  Nantmor is a village.
  Rodmarton is a  . village (75.5%)
hamlet (16.0%)
Arrow Operator
  Totopara village
  The argument album
  Tisza river
  Rodmarton   village (21.4%)
town (8.7%)
Table 2: Example queries with predictions (from BERT-large) for the different querying methods. The correct answer is marked in bold.

Since we only adjust the context fed to the model, we do not incur the cost of additional forward passes. When paraphrasing, on the other hand, each individual template requires another query to the model. Moreover, our approach does not require any learning, i.e., backward passes, and hence is very different from the classic fine-tuning approach and pattern-exploiting training (schick2020exploiting; smalllmfewshot).

In Table 2, we compare different approaches of querying by example. The left column shows the input to the model, i.e., the query. The right column shows BERT-large’s top-2 prediction, with its corresponding probabilities666

The probabilities are obtained by applying a softmax on the logit output over the token vocabulary.

. The first row of the table shows that completing the is-a relation for the village Rodmarton is tricky for the model. Its top predictions are not even close to the correct answer suggesting that BERT either does not know about this particular village or that the information need is not well enough specified. Interestingly, when prepending the query with another random example of the same relation (2nd row), the model’s top predictions are town and the ground-truth village. This proves that BERT knows the type of instance Rodmarton is; only the extraction method (the cloze-style template) was not expressive enough.

Close Examples

When humans use examples, they typically do not use a completely random subject but use one that is, by some measure, close to the subject at hand. In our introductory example, we used Ronaldo to exemplify an information need about Neuer. It would have been unnatural to use a musician here, even when describing a formally correct plays-for

relation with them. We extend our approach by only using examples for which the subject is close in latent space to the subject querying for. We use the cosine similarity between the subject encodings using BERT-base. More formally, we encode a subject



with being the BERT encoding of the CLS-token for the input , and being the BERT model’s parameters. We then obtain the top- most similar subjects to in the dataset through maximizing the cosine similarity, i.e.,


From the top- subset of most similar subjects , we randomly sample to obtain our priming examples. Table 2 (3rd row) shows the chosen close example to Rodmarton, which is Nantmor, another small village in the UK. Provided with this particular example, BERT-large predicts the ground-truth label village with more than 75% probability.

Arrow Operator

gpt3 propose to use LMs as in-context learners. They suggest providing "training" examples in the model’s context using the arrow operator, i.e., to express an triplet they provide the model with . We can apply this concept to the LAMA data by using the same template " [] []" . In Table 2 (last row), we see that by providing a few examples of the is-a relation, BERT-large can rank the ground-truth highest even though the relationship is never explicitly described in natural language. However, not using a natural language template makes the model less confident in its prediction, as can be seen by the lower probability mass it puts on the target.

5 Results

We focus the reporting of the results on the mean precision at k (P@k) metric. In line with previous work (lama; petroni2020context; whatlmknow)777The P@1 score corresponds to whatlmknow’s micro-averaged accuracy, we compute the results per relation and then average across all relations of the dataset. More formally, for the dataset that consists of relations where each relation has multiple datapoints , we compute the P@k score as:


where denotes the indicator function that is if the ground truth is in the top-k prediction set for the input and otherwise.

Corpus Relation Baselines LM
Bb Bl Al Bb Bl Bb Bb Bb Bl Bl Bl Al
Google-RE birth-place 14.9 16.1 6.3 - - 10.5 13.2 11.7 8.9 11.5 11.0 7.0
birth-date 1.6 1.5 1.5 - - 1.1 1.1 1.2 1.4 1.4 1.5 1.4
death-place 13.1 14.0 2.0 - - 9.2 11.8 10.4 7.2 9.1 8.5 5.0
Total 9.9 10.5 3.3 10.4 11.3 6.9 8.7 7.8 5.8 7.4 7.0 4.5
T-REx - 68.0 74.5 71.2 - - 59.7 62.0 62.6 66.4 67.6 68.7 69.0
- 32.4 34.2 24.9 - - 32.3 37.9 41.7 38.8 44.8 47.9 45.0
- 24.7 24.8 17.2 - - 27.9 31.3 34.8 31.4 35.0 37.2 33.5
Total 31.1 32.5 24.2 39.6 43.9 31.9 36.5 40.0 37.3 42.1 44.8 41.7
ConceptNet Total 15.9 19.5 21.2 - - 15.2 16.2 17.1 19.6 21.2 22.0 26.5
Table 3: Mean precision at one (P@1) in percent across the different corpora of the LAMA probe. The baseline models shown are BERT-base (Bb), BERT-large (Bl), Albert-xxlarge-v2 (Al), and the best versions of BERT-large and BERT-base by whatlmknow that are optimized across multiple paraphrases999These models involve one query to the model per paraphrase.(Bb and Bl). The LM section on the right shows the results for different querying by example approaches. Here, the superscript denotes the number of examples used and the subscript ce

denotes that only close examples have been used. Since the choice of examples alters the predictions of the model and thus introduces randomness, we provide the standard deviation measured over 10 evaluations.

Table 9 shows the P@1 scores of different models and querying approaches across the LAMA probe’s corpora. While for the Google-RE data, providing additional examples shows to be detrimental, we see massive prediction performance gains for T-REx and ConceptNet. Most notably, the P@1 score of BERT-large on T-REx increases by 37.8% to 44.8% when providing 10 close examples. Similarly, the lower bound on Albert’s performance for T-REx (ConceptNet) can be improved by up to 72.3% (25.0%) with 10 close examples.


For the Google-RE subset of the data, querying by example hurts the predictive capabilities of LMs. In the following, we provide an intuition of why we think this is the case. Looking at the baseline numbers of the individual relations for this data, we see that the performance is largely driven by predicting a person’s birth and death place; the birth-date relation doesn’t play a significant role because BERT is incapable of accurately predicting numbers (i.e., dates) (lin2020birds; wallace2019nlp). The birth and death place of a person BERT-large predicts correctly 16.1% and 14.0% of the time, respectively; significantly lower than the 32.5% P@1 score among the relations of the T-REx data. Recent work describes that BERT has a bias to predict that a person with, e.g., an Italian sounding name is Italian (bertology; poerner2020ebert). We suspect that this bias helps BERT predict birth and death places without knowing the actual person, and therefore it is not an adequate test of probing an LMs factual knowledge. As a consequence, the predictions it makes are more prone to errors when influenced by previous examples.


Figure 2 depicts the mean precision at 1 on the T-REx corpus for a varying number of examples provided. It shows that even a few additional examples can significantly improve the performance of the LMs. However, there is a saturation of usefulness for more examples that seems to be reached at around 10 examples already. Interestingly, with 10 examples, BERT-large even slightly improves upon the optimized paraphrase baseline from whatlmknow, while only requiring a single forward pass.
Table 4 shows the improvement in P@1 score for the individual relations that most (and least) benefit from additional examples for BERT-large. The relations for which demonstrations improve the performance the most typically have one thing in common: they are ambiguous. Prototypical ambiguous relations like located-in or is-a are among the top benefiting relations. One rather untypical improvement candidate is the top-scoring one of religion-affiliation. Suspiciously, this is also the most improved relation by the paraphrasing of whatlmknow. A closer look at the examples reveals the cause: the target object labels for the religions are provided as nouns (e.g., Christianity, Islam), while the template ([s] is affiliated with the [o] religion) indicates to use the religion as an adjective (e.g., Christian, Islamic). Hence, both paraphrasing the sentence such that it is clear to use a noun or providing example sentences that complete the template with nouns alleviate this problem. The relations that benefit the least from demonstrations are unambiguous, like capital-of or developed-by.

Figure 2: P@1 score for TREx over the number of examples provided. The dashed line shows the baseline value for when no additional example is given.
ID Template P@1
n=1 n=3 n=5
P140 [s] is affiliated with the [o] religion . 51.0 67.4 70.0
P30 [s] is located in [o] . 47.8 55.3 55.8
P136 [s] plays [o] music . 12.8 44.0 54.5
P31 [s] is a [o] . 8.2 20.3 24.4
P178 [s] is developed by [o] . -8.3 -4.2 -6.8
P1376 [s] is the capital of [o] . -16.3 -8.2 -8.6
Table 4: List of relations of T-REx that benefit the most (least) by additional examples. The right column provides the improvement in precision at 1 score when {1, 3, 5} examples are provided for BERT-large.


While T-REx probes for factual knowledge, the ConceptNet corpus is concerned with commonsense relations. The improvements of querying by example are significant with 12%, 7.5%, and 25% relative improvement for BERT-base, BERT-large, and Albert-xxlarge.

More detailed plots for all the corpora and several metrics are provided in Appendix A.4.

5.1 The Change of Embedding

To further investigate the disambiguation effect of additional examples, we take a look at the latent space. In particular, we’re interested in how the clusters of particular relations, formed by the queries’ embeddings, change when providing the context with additional examples. Figure 3 visualizes BERT-large’s [CLS]-token embedding for queries from the T-REx corpus, using t-SNE (JMLR:v9:vandermaaten08a). The individual colors represent the relations of the queries. The first two images depict the clustering when using the natural language template without additional demonstrations (left) and ten demonstrations (right). The fact that the clusters become better separated is visual proof that providing examples disambiguates the information need expressed by the queries. The two plots on the right show the clustering when instead of a natural language template, the subject and object are only separated by the arrow operator "". Here, we see an even more significant change in separability when providing additional demonstrations, as the actual information need is more ambiguous.

Figure 3: BERT-large’s [CLS]-token embedding of a subset of T-REx queries visualized in two dimensions using t-SNE (JMLR:v9:vandermaaten08a)

. Each point is a single query and the color represents the corresponding relation class. The ellipses depict the 2-std confidence intervals. The individual images show the clustering for both the natural language and the ([s]; [o]) template with either no examples or ten examples provided.

5.2 TextWorld Commonsense Evaluation

An emerging field of interest inside the NLP community is text-based games (TBG). An agent is placed inside an interactive text environment in these games and tries to complete specified goals–only using language commands. To succeed, it requires a deep language understanding to decide what are reasonable actions to take in the scene that move it closer to its final goal. These environments are often modeled on real-world scenes to foster the commonsense-learning capabilities of an agent. The TextWorld Commonsense (TWC) game world by murugesan2020textbased focus specifically on this aspect. There, the agent is placed in a typical modern-house environment to tidy up the room. This involves moving all the objects in the scene to their commonsense location, e.g., the dirty dishes belong in the dishwasher and not in the cupboard. murugesan2020textbased approach this problem by equipping the agent with access to a commonsense knowledge base. Replacing a traditional KB with an LM for this task is very intriguing as the LM has relational knowledge stored implicitly and is capable of generalizing to similar objects. To test the feasibility of using LMs as commonsense knowledge source in the TWC environment, we design the following experiment101010Details and the pseudocode are provided in Apendix A.3: We use a static agent that picks up any misplaced object at random and puts it to one of the possible locations in the scene according to a specific prior . This prior is computed at the start of an episode for all object-location combinations in the scene, using an LM. We use the arrow operator as described in Table 2 and vary the number of examples provided. In Figure 4, we show the result for albert-xxlarge on the hard games of TWC, compared to a simple uniform prior (i.e., ), and murugesan2020textbased’s RL agent with access to a commonsense KB. We see the same trend as in the LAMA experiments: providing additional examples of the same relation boosts performance significantly and saturates after 10-15 instances.

Figure 4: Normalized score for the hard games of the TWC environment over the number of examples provided for albert-xxlarge. The dashed baselines are the static agent with a uniform prior and the TWC commonsense agent by murugesan2020textbased. The shaded regions depict the standard deviation over 10 runs.

5.3 Word Analogy Evaluation

To evaluate the usefulness of querying pre-trained language models by examples for linguistic knowledge, we move to the word analogy task—a standard benchmark for non-contextual word embeddings. This evaluation is based on the premise that a good global word embedding defines a latent space in which basic arithmetic operations correspond to linguistic relations (mikolov-etal-2013-linguistic). With the rise of contextual word embeddings and large pre-trained language models, this evaluation has lost significance. However, we consider approaching this task from the angle of querying linguistic knowledge from an LM instead of performing arithmetics in latent space. By providing examples of the linguistic relation with a regular pattern in the context of the LM, we prime it to apply the relation to the final word with its masked out correspondence.
We consider the Bigger Analogy Test Set (BATS) (GladkovaDrozd2016) for our experiments. BATS consists of 40 different relations covering inflectional and derivational morphology, as well as lexicographic and encyclopedic semantics. Each relation consists of 50 unique word pairs. However, since most pre-trained LMs, including BERT and Albert, use subword-level tokens for their vocabulary, not all examples can be solved. In particular, 76.1% and 76.2% of the targets are contained in BERT’s and Albert’s vocabulary, respectively—upper bounding their P@1 performance.

Figure 5: P@1 score on BATS over the number of examples provided. The performance of the GloVe and SVD benchmark models by GladkovaDrozd2016 is shown with the black, dashed lines.

Figure 5 depicts the P@1 score111111The P@1 score corresponds to GladkovaDrozd2016’s reported accuracy score. for the individual LMs on BATS. Noticeably, also on this task, the LMs benefit from additional examples up to a certain threshold for which the usefulness stagnates. Both BERT models do not beat GladkovaDrozd2016’s GloVe (pennington-etal-2014-glove) benchmark. This is in part because not all targets are present in the token vocabulary. Considering only the solvable word pairs, BERT-large achieves a P@1 score of 30.6% with 15 examples—beating the GloVe baseline achieving 28.5%. Interestingly, Albert-xxlarge outperforms all other models, including the baselines, by a large margin. Figure 7 in Appendix A.4 breaks down the LM’s performance across the different relations of BATS and compares it against the GloVe baseline. Albert beats GloVe on almost all relations where its vocabulary does not limit it; the most significant improvements are in the derivational morphology and lexicographic semantics categories. It is outperformed by GloVe only on two relations: country:capital and UK city:county. Especially the former country:capital category is very prominent and constituted 56.7% of all semantic questions of the original Google test set (mikolov2013efficient)—potentially influencing the design and tuning of non-contextual word embeddings.

6 Discussion

Augmenting the context of LMs with demonstrations is a very successful strategy to disambiguate the query. Notably, it is as successful, on TRE-x, as using an ensemble of multiple paraphrases. The benefit of additional examples decreases when the information need is clear to the model; this is the case for unambiguous prompts or when enough (around 10) demonstrations are provided. Even in the extreme case of ambiguity, for example, when the arrow operator ([s] [o]) is used to indicate a relation, providing only a handful of examples clarifies the relation sufficiently in many cases. We showed that the usefulness of providing additional demonstrations quickly vanishes. Hence, when having access to more labeled data and the option to re-train the model, a fine-tuning strategy is still better suited to maximize the performance on a given task. Moreover, casting NLP problems as language modeling tasks only works as long as the target is a single-token word of the LM’s vocabulary. While technically large generation-based LMs as GPT (gpt3; gpt2) or T5 (T5) can generate longer sequences, it is not clear how to compare solutions of varying length.

7 Conclusion

In this work, we explored the effect of providing examples to probing LMs relational knowledge. We showed that already a few demonstrations—supplied in the context of the LM—disambiguate the query to the same extent as using an optimized ensemble of multiple paraphrases. We base our findings on experimental results of the LAMA probe, the BATS word analogy test, and a TBG commonsense evaluation. On the T-REx corpus’ factual relations, providing 10 demonstrations improves BERT’s P@1 performance by 37.8%. Similarly, on ConceptNet’s commonsense relations, Albert’s performance improves by 25% with access to 10 examples. We conclude that providing demonstrations is a simple yet effective strategy to clarify ambiguous prompts to a language model.


Appendix A Appendices

a.1 Implementation Details

The source code to reproduce all the experiments is available at All individual runs reported in the paper can be carried out on a single GPU (TESLA P100 16GB), though speedups can be realized when using multiple GPUs in parallel. The wall-clock runtime for the corpora of the LAMA probe is shown in Table 5.

All models used in this work are accessed from the Huggingface’s list of pre-trained models for PyTorch

(DBLP:journals/corr/abs-1910-03771). Further details about these models are provided on the following webpage:

Corpus Model # Parameters Avg. Input Length Runtime [s]
Google-RE bert-base-cased 109M 5.5 12.8
bert-base-cased 60.3 36.1
bert-base-cased 60.1 39.6
bert-large-cased 335M 5.5 20.5
bert-large-cased 60.3 85.5
bert-large-cased 60.1 99.7
albert-xxlarge-v2 223M 5.5 85.4
albert-xxlarge-v2 60.3 466.0
albert-xxlarge-v2 60.1 544.9
T-REx bert-base-cased 109M 7.6 72.6
bert-base-cased 83.2 239.0
bert-base-cased 82.7 234.1
bert-large-cased 335M 7.6 119.3
bert-large-cased 83.2 747.5
bert-large-cased 82.7 596.5
albert-xxlarge-v2 223M 7.6 504.1
albert-xxlarge-v2 83.2 3227.4
albert-xxlarge-v2 82.7 3340.9
ConceptNet bert-base-cased 109M 9.4 38.5
bert-base-cased 102.8 121.9
bert-base-cased 104.5 124.6
bert-large-cased 335M 9.4 80.4
bert-large-cased 102.8 311.4
bert-large-cased 104.5 324.3
albert-xxlarge-v2 223M 9.4 408.0
albert-xxlarge-v2 102.8 1760.8
albert-xxlarge-v2 104.5 1853.6
Table 5: The runtime in seconds to go once through the full data from the LAMA probe on a single TESLA P100 GPU with a batch size of 32. The superscript of the model represents the number of examples used for querying and the subscript of ce indicates that close examples are used.

a.2 The Choice of Template

When providing examples, we give the model the chance to understand the relationship for which we query without providing additional instructions. This naturally raises the question of whether or not natural language templates are even necessary to query LMs. Most prominently, the in-context learning of gpt3 shows that large LMs can complete patterns even when not provided in natural language. In particular, they use the "="-operator to express the relation between input and output. In Figure 6, we compare the natural language cloze-style template against three different non-language templates: (i) [s] = [o], (ii) [s] - [o], (iii) ([s]; [o]). Surprisingly, gpt3’s "=

"-operator performs the worst for BERT-large on T-TREx, while separating the subject and objects by a semicolon works best—almost on par with the performance of the natural language template after providing just a single example. This result underlines BERT’s remarkable pattern-matching capabilities and suggests that a natural language description of the relation is not always needed—even when querying relatively small LMs.

Figure 6: P@1 score for BERT-large on TREx over the number of examples provided. Each line corresponds to one template determining how the examples are provided: (i) with the natural language templates from the LAMA probe (NL Template), (ii) separated by a semicolon (([s]; [o])), (iii) separated by a one-lined arrow ([s] - [o]), or (iv) separated by a double-lined arrow ([s] = [o]). The dashed line shows the baseline value for when no additional example is given.

a.3 Details TextWorld Commonsense Evaluation

Text-based games (TBG) are computer games where the sole modality of interaction is text. Classic games like Zork (infocom) used to be played by a large fan base worldwide. Today, they provide interesting challenges for the research field of interactive NLP. With the TextWorld framework by DBLP:journals/corr/abs-1806-11532, it is possible to design custom TBGs; allowing to adapt the objects, locations, and goals around the investigated research objectives. TBGs of this framework can vary from treasure hunting (DBLP:journals/corr/abs-1806-11532) to cooking recipes (adhikari2021learning; DBLP:journals/corr/abs-1909-01646), or–as in the experiment at hand–tidying up a room (murugesan2020textbased). murugesan2020textbased designed the TextWorld Commonsense environment TWC around the task of cleaning up a modern house environment to probe an agent about its commonsense abilities. For example, a successful agent should understand that dirty dishes belong in the dishwasher while clean dishes in the cupboard. murugesan2020textbased approach this problem by developing an agent that, through a graph-based network, has access to relevant facts from the ConceptNet (speer2018conceptnet) commonsense knowledge base. Here, the obvious downside of static KBs for commonsense knowledge extraction becomes apparent: it does not generalize to not listed object-location pairs. Hence, slight deviations of typical entities require additional processing to be able to query the KB. A large pre-trained LM seems to be better suited for this task due to its querying flexibility and generalization capabilities. We test these abilities by designing a static agent as described in the following Algorithm 1, that has access to a large pre-trained LM.

Input: TWC game G, pre-trained language model LM
Function GetPrior():

Function to determine a probability distribution over the locations

for each object in using the language model LM. */
       forall object  do
             /* Use demonstrations to build context for LM, e.g.: */
            /* milk fridge */
            /* dirty dishes sink */
            /* [MASK] */
             /* Compute MASK-token probabilities for the locations in using LM */
       end forall
      return p
while G not finished & max steps not exhausted do
       if agent holds an object  then
             if  correct location for  then
                   remove from
                   prior[] 0
             end if
       end if
end while
Algorithm 1 LM-prior Agent

a.4 Omitted Figures

Google-RE T-REx ConceptNet



Table 6: P@1 score for the different corpora of the LAMA probe over the number of examples provided. The dashed line shows the baseline values for when no additional example is given. The upper row depicts the scores for when the examples are chosen randomly among the same relation, while the lower row only considers examples from close subjects as defined in Section 4.
Google-RE T-REx ConceptNet



Table 7: Mean reciprocal rank (MRR) score for the different corpora of the LAMA probe over the number of examples provided. The dashed line shows the baseline values for when no additional example is given. The upper row depicts the scores for when the examples are chosen randomly among the same relation, while the lower row only considers examples from close subjects as defined in Section 4.
Google-RE T-REx ConceptNet



Table 8: Probability assigned to the ground-truth object for the different corpora of the LAMA probe over the number of examples provided. The dashed line shows the baseline values for when no additional example is given. The upper row depicts the scores for when the examples are chosen randomly among the same relation, while the lower row only considers examples from close subjects as defined in Section 4.
Figure 7: P@1 score on BATS for Albert-xxlarge with 10 examples that use the "([s]; [o])"-template. The x-axis breaks down the performance for the individual relations of the BATS dataset. As a benchmark, we use the GloVe model from GladkovaDrozd2016. The frame around the bar indicates the maximum possible score that the Albert model could have scored because not all targets are tokens in its vocabulary.