Language Models (LM) are omnipresent in modern NLP systems. In just a few years, they’ve been established as the standard feature extractor for many different language understanding tasks (karpukhin2020dense; zhang2020retrospective; wang2019structbert; he2020deberta). Typically, they are used to create a latent representation of natural language input and then fine-tuned to the task at hand. However, recent work (lama; whatlmknow; gpt3; 2020arXiv200208910R) has shown that off-the-shelve language models capture not only linguistic features but also large amounts of relational knowledge, not requiring any form of re-training.
The LAMA probe by lama was designed to quantify the amount of relational knowledge present in (mask-based) language models. While the task of predicting the right object for a subject-relation tuple remains the same as for a standard knowledge base (KB) completion query, the input is structured in a cloze-style sentence. For example, a KB completion query of the form (Dante, born-in, X) becomes "Dante was born in [MASK].". lama show that BERT (devlin2019bert) performs on par with competitive specialized models on factual and commonsense knowledge. The performance on this task can only be seen as a lower bound to the actual knowledge present in language models as the choice of natural language template for a given relation might be suboptimal (lama; whatlmknow). The more general question here is "How to query an LM for a specific information need?". whatlmknow propose to use multiple paraphrases of the probe and then aggregate the solutions. petroni2020context, on the other hand, add relevant context. Both approaches can be linked to common human behavior. In human dialog, a question can be made more precise both by paraphrasing or adding additional context information. Since language models are trained on large amounts of human-generated data, the intuition of phrasing the information need most naturally
seems obvious. Humans excel at pattern recognition and pattern continuation for many different modes of representation(doi:10.1007/BF02833890). Concepts embedded in language are no exception to this. Therefore, another common way to probe a human’s knowledge is by providing examples and asking them to transfer the relation provided to a new object. For example, asking Who plays Neuer for? is ambiguous as both Bayern Munich and Germany would be correct answers. However, when contextualizing the question with an example, the answer is clear: I know Ronaldo plays for Portugal. Who plays Neuer for?.
In this work, we apply the concept of querying by example to probe language models. Additional to the cloze-style question, we provide other examples of the same relation to the model’s input. The previous example’s input then becomes "Ronaldo plays for Portugal. Neuer plays for [MASK].". We show that by providing only a few demonstrations, standard language models’ prediction performance improves drastically. So much so that for the TREx dataset, it becomes an even more powerful technique to retrieve knowledge than using an ensemble of up to 40 different paraphrases (whatlmknow), while requiring only a single forward pass instead of 40.
2 Related Work
Language Model Probes
lama started to investigate how much factual and commonsense knowledge LMs posses. They released the LAMA probe, which is a dataset consisting of T-REx (trex), Google-RE, ConceptNet (speer2018conceptnet), and SQuAD (DBLP:journals/corr/RajpurkarZLL16). Each dataset is transformed to be a collection of subject, relation, object-triplets and pruned to only contain single token objects present in BERT’s vocabulary. Additionally, they provide templates in natural language for each relation. Their investigation reveals that BERT-large has remarkable capabilities in recalling factual knowledge, competitive to supervised baseline systems.
Since there is usually more than one way to express a relation, the LAMA probe score can only be regarded as a lower bound (lama; whatlmknow). To tighten this lower bound, whatlmknow propose an automatic discovering mechanism for paraphrases together with an aggregation scheme. By querying the LM with a diverse set of prompts, they significantly improve the LAMA probe’s baseline numbers for BERT models. However, this approach incurs the cost of additional queries to the LM, an optimization procedure to aggregate the results, and the extraction of paraphrases.
Machine reading comprehension (MRC) and open-domain question answering (QA) are fields in NLP dominated by large pre-trained LMs. Here, the premise typically is that the model is capable of extracting the answer from the provided context, rather than having it stored in its parameters111With the notable exception of the work of 2020arXiv200208910R, which uses a T-5 model without any access to an additional knowledge base.. petroni2020context extend this line of thought to retrieve factual knowledge from LMs by providing relevant context but without fine-tuning the model. Their experiments show that providing relevant passages significantly improves the scores on the LAMA probe for BERT models.
The term few-shot learning refers to the practice of only providing a few examples when training a model, compared to the typical approach of using large datasets (wang2020generalizing). In the NLP domain, recent work by gpt3 suggests to use these few examples only in the context, as opposed to actually training with it. Fittingly, they call this approach in-context learning. Here, they condition the model on a natural language description of the task together with a few demonstrations. Their experiments reveal that the larger the model, the better its in-context learning capabilities. Our approach is very similar to in-context learning, with the difference that we do not provide a description of the task and utilize natural language templates for the relations. The motivation is that this should closely resemble human behavior of providing examples of a relation: instead of providing a list of subject and objects and let the other person figure out the relation, a human typically provides the subject and objects embedded in the template relation. Moreover, we understand our approach not as a learning method, but rather as a querying technique that disambiguates the information need.
smalllmfewshot argue that small LMs can be effective for few-shot learning too. However, they approach the problem of limited examples differently; instead of providing it as conditioning in the input, they actually train with it. By embedding the data into relation templates, they obtain training data that is closer in style to the pre-training data and, thus, can learn with fewer samples. gao2020making take this concept even further and automate the template generation. Additionally, they also find that—when fine-tuning with few samples—providing good demonstrations in the context improves the model’s performance.
3.1 Language Models for cloze-style QA
In this work, we probe mask-based language models for their relational knowledge. The considered facts are triplets consisting of a subject, a relation, and an object
. Language models are trained to predict the most probable word given the (surrounding) context. Hence, to test a model’s factual knowledge, we feed it natural text with the object masked out. This requires a mapping from the relationto a natural language prompt with placeholders for subject and object, e.g., the relation = age becomes = [s] is [o] years old. When probing for a single -triplet, the input to the language model is the natural language prompt of the relation together with the subject . It outputs a likelihood score for each token in its vocabulary which we use to construct a top- prediction subset for the object :
The language model succeeds for the triplet @ if . For example, we say that it knows the fact Tiger Woods age @3, if for the query "Tiger Woods is [MASK] years old" it ranks the token "45" within the top-3 of the vocabulary.
We use the LAMA probe in our experiments (lama). It’s a collection of factual and commonsense examples provided as -triplets222 We do not consider the SQuAD dataset of the probe as it has no clear notion of
We do not consider the SQuAD dataset of the probe as it has no clear notion ofrelation. with single token objects. Moreover, it provides human-generated templates for each relation . The statistics about the three considered corpora T-REx (trex), Google-RE333https://github.com/google-research-datasets/relation-extraction-corpus, and ConceptNet (speer2018conceptnet) are provided in Table 1.
We investigate the usefulness of querying by example, for three individual language models: BERT-base, BERT-large (devlin2019bert), and ALBERT-xxl (lan2020albert). These models are among the most frequently used language models these days444According to the statistics from https://huggingface.co/models?filter=pytorch,masked-lm.. For both BERT models, we consider the cased variant, unless explicitly noted otherwise.
Our proposed method for querying relational knowledge from LMs is simple yet effective. When we construct the query for the triplet , we provide the model with additional samples of the same relation . These additional examples are converted to their natural language equivalent using the template and prepend to the cloze-style sentence representation of . The intuition is that the non-masked examples provide the model with an idea of filling in the gap for the relation at hand. As can be seen in Figure 1, providing a single example in the same structure clarifies the object requested for both humans and BERT. This is particularly useful when the template does not capture the desired relation between subject and object unambiguously, which in natural language is likely to be the case for many relations. In this sense, it tries to solve the same problem as paraphrasing. A query is paraphrased multiple times to align the model’s understanding of the query with the actual information need. When we provide additional examples, we do the same by showing the model how to apply the relation to other instances and ask it to generalize. Of course, the model does not reason in this exact way; rather, through its training data, it is biased towards completing patterns as this is a ubiquitous behavior in human writing.
|Rodmarton555A village in South West England.is a .||farmer (3.9%)|
|M.S.I. Airport is a airport.|
|Rodmarton is a .||town (16.9%)|
|Nantmor is a village.|
|Rodmarton is a .||village (75.5%)|
|The argument album|
Since we only adjust the context fed to the model, we do not incur the cost of additional forward passes. When paraphrasing, on the other hand, each individual template requires another query to the model. Moreover, our approach does not require any learning, i.e., backward passes, and hence is very different from the classic fine-tuning approach and pattern-exploiting training (schick2020exploiting; smalllmfewshot).
In Table 2, we compare different approaches of querying by example. The left column shows the input to the model, i.e., the query. The right column shows BERT-large’s top-2 prediction, with its corresponding probabilities666 The probabilities are obtained by applying a softmax on the logit output over the token vocabulary.
The probabilities are obtained by applying a softmax on the logit output over the token vocabulary.. The first row of the table shows that completing the is-a relation for the village Rodmarton is tricky for the model. Its top predictions are not even close to the correct answer suggesting that BERT either does not know about this particular village or that the information need is not well enough specified. Interestingly, when prepending the query with another random example of the same relation (2nd row), the model’s top predictions are town and the ground-truth village. This proves that BERT knows the type of instance Rodmarton is; only the extraction method (the cloze-style template) was not expressive enough.
When humans use examples, they typically do not use a completely random subject but use one that is, by some measure, close to the subject at hand. In our introductory example, we used Ronaldo to exemplify an information need about Neuer. It would have been unnatural to use a musician here, even when describing a formally correct plays-for
relation with them. We extend our approach by only using examples for which the subject is close in latent space to the subject querying for. We use the cosine similarity between the subject encodings using BERT-base. More formally, we encode a subjectusing
with being the BERT encoding of the CLS-token for the input , and being the BERT model’s parameters. We then obtain the top- most similar subjects to in the dataset through maximizing the cosine similarity, i.e.,
From the top- subset of most similar subjects , we randomly sample to obtain our priming examples. Table 2 (3rd row) shows the chosen close example to Rodmarton, which is Nantmor, another small village in the UK. Provided with this particular example, BERT-large predicts the ground-truth label village with more than 75% probability.
gpt3 propose to use LMs as in-context learners. They suggest providing "training" examples in the model’s context using the arrow operator, i.e., to express an triplet they provide the model with . We can apply this concept to the LAMA data by using the same template "  " . In Table 2 (last row), we see that by providing a few examples of the is-a relation, BERT-large can rank the ground-truth highest even though the relationship is never explicitly described in natural language. However, not using a natural language template makes the model less confident in its prediction, as can be seen by the lower probability mass it puts on the target.
We focus the reporting of the results on the mean precision at k (P@k) metric. In line with previous work (lama; petroni2020context; whatlmknow)777The P@1 score corresponds to whatlmknow’s micro-averaged accuracy, we compute the results per relation and then average across all relations of the dataset. More formally, for the dataset that consists of relations where each relation has multiple datapoints , we compute the P@k score as:
where denotes the indicator function that is if the ground truth is in the top-k prediction set for the input and otherwise.
denotes that only close examples have been used. Since the choice of examples alters the predictions of the model and thus introduces randomness, we provide the standard deviation measured over 10 evaluations.
Table 9 shows the P@1 scores of different models and querying approaches across the LAMA probe’s corpora. While for the Google-RE data, providing additional examples shows to be detrimental, we see massive prediction performance gains for T-REx and ConceptNet. Most notably, the P@1 score of BERT-large on T-REx increases by 37.8% to 44.8% when providing 10 close examples. Similarly, the lower bound on Albert’s performance for T-REx (ConceptNet) can be improved by up to 72.3% (25.0%) with 10 close examples.
For the Google-RE subset of the data, querying by example hurts the predictive capabilities of LMs. In the following, we provide an intuition of why we think this is the case. Looking at the baseline numbers of the individual relations for this data, we see that the performance is largely driven by predicting a person’s birth and death place; the birth-date relation doesn’t play a significant role because BERT is incapable of accurately predicting numbers (i.e., dates) (lin2020birds; wallace2019nlp). The birth and death place of a person BERT-large predicts correctly 16.1% and 14.0% of the time, respectively; significantly lower than the 32.5% P@1 score among the relations of the T-REx data. Recent work describes that BERT has a bias to predict that a person with, e.g., an Italian sounding name is Italian (bertology; poerner2020ebert). We suspect that this bias helps BERT predict birth and death places without knowing the actual person, and therefore it is not an adequate test of probing an LMs factual knowledge. As a consequence, the predictions it makes are more prone to errors when influenced by previous examples.
Figure 2 depicts the mean precision at 1 on the T-REx corpus for a varying number of examples provided. It shows that even a few additional examples can significantly improve the performance of the LMs. However, there is a saturation of usefulness for more examples that seems to be reached at around 10 examples already. Interestingly, with 10 examples, BERT-large even slightly improves upon the optimized paraphrase baseline from whatlmknow, while only requiring a single forward pass.
Table 4 shows the improvement in P@1 score for the individual relations that most (and least) benefit from additional examples for BERT-large. The relations for which demonstrations improve the performance the most typically have one thing in common: they are ambiguous. Prototypical ambiguous relations like located-in or is-a are among the top benefiting relations. One rather untypical improvement candidate is the top-scoring one of religion-affiliation. Suspiciously, this is also the most improved relation by the paraphrasing of whatlmknow. A closer look at the examples reveals the cause: the target object labels for the religions are provided as nouns (e.g., Christianity, Islam), while the template ([s] is affiliated with the [o] religion) indicates to use the religion as an adjective (e.g., Christian, Islamic). Hence, both paraphrasing the sentence such that it is clear to use a noun or providing example sentences that complete the template with nouns alleviate this problem. The relations that benefit the least from demonstrations are unambiguous, like capital-of or developed-by.
|P140||[s] is affiliated with the [o] religion .||51.0||67.4||70.0|
|P30||[s] is located in [o] .||47.8||55.3||55.8|
|P136||[s] plays [o] music .||12.8||44.0||54.5|
|P31||[s] is a [o] .||8.2||20.3||24.4|
|P178||[s] is developed by [o] .||-8.3||-4.2||-6.8|
|P1376||[s] is the capital of [o] .||-16.3||-8.2||-8.6|
While T-REx probes for factual knowledge, the ConceptNet corpus is concerned with commonsense relations. The improvements of querying by example are significant with 12%, 7.5%, and 25% relative improvement for BERT-base, BERT-large, and Albert-xxlarge.
More detailed plots for all the corpora and several metrics are provided in Appendix A.4.
5.1 The Change of Embedding
To further investigate the disambiguation effect of additional examples, we take a look at the latent space. In particular, we’re interested in how the clusters of particular relations, formed by the queries’ embeddings, change when providing the context with additional examples. Figure 3 visualizes BERT-large’s [CLS]-token embedding for queries from the T-REx corpus, using t-SNE (JMLR:v9:vandermaaten08a). The individual colors represent the relations of the queries. The first two images depict the clustering when using the natural language template without additional demonstrations (left) and ten demonstrations (right). The fact that the clusters become better separated is visual proof that providing examples disambiguates the information need expressed by the queries. The two plots on the right show the clustering when instead of a natural language template, the subject and object are only separated by the arrow operator "". Here, we see an even more significant change in separability when providing additional demonstrations, as the actual information need is more ambiguous.
5.2 TextWorld Commonsense Evaluation
An emerging field of interest inside the NLP community is text-based games (TBG). An agent is placed inside an interactive text environment in these games and tries to complete specified goals–only using language commands. To succeed, it requires a deep language understanding to decide what are reasonable actions to take in the scene that move it closer to its final goal. These environments are often modeled on real-world scenes to foster the commonsense-learning capabilities of an agent. The TextWorld Commonsense (TWC) game world by murugesan2020textbased focus specifically on this aspect. There, the agent is placed in a typical modern-house environment to tidy up the room. This involves moving all the objects in the scene to their commonsense location, e.g., the dirty dishes belong in the dishwasher and not in the cupboard. murugesan2020textbased approach this problem by equipping the agent with access to a commonsense knowledge base. Replacing a traditional KB with an LM for this task is very intriguing as the LM has relational knowledge stored implicitly and is capable of generalizing to similar objects. To test the feasibility of using LMs as commonsense knowledge source in the TWC environment, we design the following experiment101010Details and the pseudocode are provided in Apendix A.3: We use a static agent that picks up any misplaced object at random and puts it to one of the possible locations in the scene according to a specific prior . This prior is computed at the start of an episode for all object-location combinations in the scene, using an LM. We use the arrow operator as described in Table 2 and vary the number of examples provided. In Figure 4, we show the result for albert-xxlarge on the hard games of TWC, compared to a simple uniform prior (i.e., ), and murugesan2020textbased’s RL agent with access to a commonsense KB. We see the same trend as in the LAMA experiments: providing additional examples of the same relation boosts performance significantly and saturates after 10-15 instances.
5.3 Word Analogy Evaluation
To evaluate the usefulness of querying pre-trained language models by examples for linguistic knowledge, we move to the word analogy task—a standard benchmark for non-contextual word embeddings. This evaluation is based on the premise that a good global word embedding defines a latent space in which basic arithmetic operations correspond to linguistic relations (mikolov-etal-2013-linguistic). With the rise of contextual word embeddings and large pre-trained language models, this evaluation has lost significance. However, we consider approaching this task from the angle of querying linguistic knowledge from an LM instead of performing arithmetics in latent space. By providing examples of the linguistic relation with a regular pattern in the context of the LM, we prime it to apply the relation to the final word with its masked out correspondence.
We consider the Bigger Analogy Test Set (BATS) (GladkovaDrozd2016) for our experiments. BATS consists of 40 different relations covering inflectional and derivational morphology, as well as lexicographic and encyclopedic semantics. Each relation consists of 50 unique word pairs. However, since most pre-trained LMs, including BERT and Albert, use subword-level tokens for their vocabulary, not all examples can be solved. In particular, 76.1% and 76.2% of the targets are contained in BERT’s and Albert’s vocabulary, respectively—upper bounding their P@1 performance.
Figure 5 depicts the P@1 score111111The P@1 score corresponds to GladkovaDrozd2016’s reported accuracy score. for the individual LMs on BATS. Noticeably, also on this task, the LMs benefit from additional examples up to a certain threshold for which the usefulness stagnates. Both BERT models do not beat GladkovaDrozd2016’s GloVe (pennington-etal-2014-glove) benchmark. This is in part because not all targets are present in the token vocabulary. Considering only the solvable word pairs, BERT-large achieves a P@1 score of 30.6% with 15 examples—beating the GloVe baseline achieving 28.5%. Interestingly, Albert-xxlarge outperforms all other models, including the baselines, by a large margin. Figure 7 in Appendix A.4 breaks down the LM’s performance across the different relations of BATS and compares it against the GloVe baseline. Albert beats GloVe on almost all relations where its vocabulary does not limit it; the most significant improvements are in the derivational morphology and lexicographic semantics categories. It is outperformed by GloVe only on two relations: country:capital and UK city:county. Especially the former country:capital category is very prominent and constituted 56.7% of all semantic questions of the original Google test set (mikolov2013efficient)—potentially influencing the design and tuning of non-contextual word embeddings.
Augmenting the context of LMs with demonstrations is a very successful strategy to disambiguate the query. Notably, it is as successful, on TRE-x, as using an ensemble of multiple paraphrases. The benefit of additional examples decreases when the information need is clear to the model; this is the case for unambiguous prompts or when enough (around 10) demonstrations are provided. Even in the extreme case of ambiguity, for example, when the arrow operator ([s] [o]) is used to indicate a relation, providing only a handful of examples clarifies the relation sufficiently in many cases. We showed that the usefulness of providing additional demonstrations quickly vanishes. Hence, when having access to more labeled data and the option to re-train the model, a fine-tuning strategy is still better suited to maximize the performance on a given task. Moreover, casting NLP problems as language modeling tasks only works as long as the target is a single-token word of the LM’s vocabulary. While technically large generation-based LMs as GPT (gpt3; gpt2) or T5 (T5) can generate longer sequences, it is not clear how to compare solutions of varying length.
In this work, we explored the effect of providing examples to probing LMs relational knowledge. We showed that already a few demonstrations—supplied in the context of the LM—disambiguate the query to the same extent as using an optimized ensemble of multiple paraphrases. We base our findings on experimental results of the LAMA probe, the BATS word analogy test, and a TBG commonsense evaluation. On the T-REx corpus’ factual relations, providing 10 demonstrations improves BERT’s P@1 performance by 37.8%. Similarly, on ConceptNet’s commonsense relations, Albert’s performance improves by 25% with access to 10 examples. We conclude that providing demonstrations is a simple yet effective strategy to clarify ambiguous prompts to a language model.
Appendix A Appendices
a.1 Implementation Details
The source code to reproduce all the experiments is available at https://github.com/leox1v/lmkb_public.
All individual runs reported in the paper can be carried out on a single GPU (TESLA P100 16GB), though speedups can be realized when using multiple GPUs in parallel. The wall-clock runtime for the corpora of the LAMA probe is shown in Table 5.
All models used in this work are accessed from the Huggingface’s list of pre-trained models for PyTorch(DBLP:journals/corr/abs-1910-03771). Further details about these models are provided on the following webpage: https://huggingface.co/transformers/pretrained_models.html.
|Corpus||Model||# Parameters||Avg. Input Length||Runtime [s]|
a.2 The Choice of Template
When providing examples, we give the model the chance to understand the relationship for which we query without providing additional instructions. This naturally raises the question of whether or not natural language templates are even necessary to query LMs. Most prominently, the in-context learning of gpt3 shows that large LMs can complete patterns even when not provided in natural language. In particular, they use the "="-operator to express the relation between input and output. In Figure 6, we compare the natural language cloze-style template against three different non-language templates: (i) [s] = [o], (ii) [s] - [o], (iii) ([s]; [o]). Surprisingly, gpt3’s "=
"-operator performs the worst for BERT-large on T-TREx, while separating the subject and objects by a semicolon works best—almost on par with the performance of the natural language template after providing just a single example. This result underlines BERT’s remarkable pattern-matching capabilities and suggests that a natural language description of the relation is not always needed—even when querying relatively small LMs.
a.3 Details TextWorld Commonsense Evaluation
Text-based games (TBG) are computer games where the sole modality of interaction is text. Classic games like Zork (infocom) used to be played by a large fan base worldwide. Today, they provide interesting challenges for the research field of interactive NLP. With the TextWorld framework by DBLP:journals/corr/abs-1806-11532, it is possible to design custom TBGs; allowing to adapt the objects, locations, and goals around the investigated research objectives. TBGs of this framework can vary from treasure hunting (DBLP:journals/corr/abs-1806-11532) to cooking recipes (adhikari2021learning; DBLP:journals/corr/abs-1909-01646), or–as in the experiment at hand–tidying up a room (murugesan2020textbased). murugesan2020textbased designed the TextWorld Commonsense environment TWC around the task of cleaning up a modern house environment to probe an agent about its commonsense abilities. For example, a successful agent should understand that dirty dishes belong in the dishwasher while clean dishes in the cupboard. murugesan2020textbased approach this problem by developing an agent that, through a graph-based network, has access to relevant facts from the ConceptNet (speer2018conceptnet) commonsense knowledge base. Here, the obvious downside of static KBs for commonsense knowledge extraction becomes apparent: it does not generalize to not listed object-location pairs. Hence, slight deviations of typical entities require additional processing to be able to query the KB. A large pre-trained LM seems to be better suited for this task due to its querying flexibility and generalization capabilities. We test these abilities by designing a static agent as described in the following Algorithm 1, that has access to a large pre-trained LM.
a.4 Omitted Figures