Learning To Retrieve Prompts for In-Context Learning

by   Ohad Rubin, et al.
Tel Aviv University

In-context learning is a recent paradigm in natural language understanding, where a large pre-trained language model (LM) observes a test instance and a few training examples as its input, and directly decodes the output without any update to its parameters. However, performance has been shown to strongly depend on the selected training examples (termed prompt). In this work, we propose an efficient method for retrieving prompts for in-context learning using annotated data and a LM. Given an input-output pair, we estimate the probability of the output given the input and a candidate training example as the prompt, and label training examples as positive or negative based on this probability. We then train an efficient dense retriever from this data, which is used to retrieve training examples as prompts at test time. We evaluate our approach on three sequence-to-sequence tasks where language utterances are mapped to meaning representations, and find that it substantially outperforms prior work and multiple baselines across the board.



There are no comments yet.


page 1

page 2

page 3

page 4


In-Context Learning for Few-Shot Dialogue State Tracking

Collecting and annotating task-oriented dialogues is time-consuming and ...

Learning from networked examples in a k-partite graph

Many machine learning algorithms are based on the assumption that traini...

Using More Data to Speed-up Training Time

In many recent applications, data is plentiful. By now, we have a rather...

On the Limits of Learning to Actively Learn Semantic Representations

One of the goals of natural language understanding is to develop models ...

Calibrate Before Use: Improving Few-Shot Performance of Language Models

GPT-3 can perform numerous tasks when provided a natural language prompt...

Adversarial Learning of Semantic Relevance in Text to Image Synthesis

We describe a new approach that improves the training of generative adve...

Learning with Different Amounts of Annotation: From Zero to Many Labels

Training NLP systems typically assumes access to annotated data that has...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The striking language skills and world knowledge embedded in large pre-trained language models (LMs)

(Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020; Petroni et al., 2019) have recently led to in-context learning, a new paradigm in natural language understanding. Under this paradigm, a language model is given a prompt, which typically contains a few training examples, as well as a test instance as input, and then the LM generates the output for the test instance directly, without any update to its parameters. This approach was first introduced in GPT-3 Brown et al. (2020), but has quickly spread to even larger LMs, such as Jurassic-1 (Lieber et al., 2021), and GLaM Du et al. (2021).

Figure 1: An overview of prompt retrieval: Given a question from Break, we retrieve similar training examples from an index of the training set. The question and training examples (the prompt) are passed to an inference LM that decodes the output.

An attractive property of in-context learning is that it provides a single model for multiple language understanding tasks. However, it has been shown that downstream performance can vary widely conditioned on the choice of in-context examples (Liu et al., 2021a). This has sparked interest in prompt retrieval (see Fig. 1), where given a test instance, training examples are chosen for the prompt based on some similarity metric. Recent work has either used off-the-shelf unsupervised similarity metrics, or trained a prompt retriever to select examples based on surface similarity Das et al. (2021).

Figure 2: An overview of our approach for training EPR. Given a training example, we use an unsupervised retriever to obtain a set of candidates. We then pass the candidates to a scoring LM and label the top- and the bottom- as positive and negative examples, respectively. Last, we use this training data to train a dense retriever.

In this work, we suggest to use language models themselves to label examples that can serve as good prompts, and train a prompt retriever from this signal. To train the retriever (see Fig. 2), we assume access to a training set of input-output pairs and to a scoring LM, i.e., a language model that will be used to score prompts. For each training example , we go over other candidate training examples, and estimate the probability, according to the scoring LM, of conditioned on and

the candidate prompt. We label training examples that lead to high probability as positive examples and low probability as negative examples and train a prompt retriever from this data using contrastive learning. We argue that using a LM for labeling examples is a better proxy for training a retriever compared to previously-proposed surface similarity heuristics. Importantly, when creating the training data, we have access to the gold label

, which can be used to obtain a high-quality set of candidate prompts. This leads to good positive examples and hard negative examples, which are beneficial for training with a contrastive objective.

Using a scoring LM to train an efficient retriever for a potentially different test time inference LM is beneficial in two scenarios. First, when the scoring LM is smaller than the inference LM and serves as a proxy for it. This results in cheap and efficient data generation for the retriever, accessible to a wide range of researchers. Second, our approach can be used even when the scoring and inference LMs are identical (e.g., both are GPT-3). This makes sense when we cannot access model weights and only use it as a service, an increasingly popular paradigm. In this case, we use the LM to train a much lighter-weight retriever that is only tasked with learning a similarity function. More generally, given that the scale of LMs is likely to keep increasing in the foreseeable future, one can view our approach for Efficient Prompt Retrieval, or EPR, as a method for interfacing and learning to interact with large LMs.

We empirically test EPR on three structured sequence-to-sequence tasks, where input natural language utterances are mapped to a meaning representation: MTop (Li et al., 2021) and SMCalFlow(Andreas et al., 2020), which focus on task-oriented dialogue, and Break (Wolfson et al., 2020), a benchmark for mapping questions to a language-based meaning representation. We observe that EPR substantially improves performance compared to prior work on prompt retrieval. When the scoring LM and inference LM are identical (using GPT-Neo (Black et al., 2021)), performance compared to the best baseline improves from 26%31.9% on Break, from 57%64.2% on MTop, and from 51.4%54.3% on SMCalFlow. When using GPT-Neo as a proxy for larger LMs (GPT-J, GPT-3, and Codex), we observe similar results, where performance improves substantially in all cases.

To conclude, we propose a new approach for retrieving training examples for in-context learning in large language models, and show it substantially outperforms prior methods. Given recent developments in scaling language models, designing efficient methods for interacting with them is an important direction for future research. All of our code and data are publicly available at https://github.com/OhadRubin/EPR.

2 Background: Prompt Retrieval

Problem setup

Given a training set of input-output sequences, and a test example , our goal is to train a retriever model, , that will retrieve a subset of training examples , where . We succinctly refer to the set of training examples as prompt.111The term prompt is often used to refer to a natural language template filled by an input example (Liu et al., 2021b), but here it denotes a set of training examples provided as input to the LM.

Assuming access to an inference LM,

, a good prompt should lead to the target output sequence when the test example

is concatenated to the prompt and passed as a prefix to . Specifically, decoding from the LM should yield . In this work, we focus on structured tasks, such as semantic parsing, where is a natural language utterance and is a meaning representation for that utterance.

Prior work

Liu2021WhatMG investigated the effect of different prompts on the performance of GPT-3 and demonstrated that the choice of in-context examples strongly affects downstream performance. Consequently, they used an unsupervised sentence encoder to encode the training examples, and retrieved for every test instance the nearest training examples.

das-etal-2021-case proposed to train a a supervised prompt retriever for knowledge-base question answering. The retriever was trained with supervision that is tailored for knowledge-base queries, and relies on surface similarity between formal queries. Conversely, our approach takes advantage of the generative LM itself and is thus more general.

shin-etal-2021-constrained used GPT-3 to select examples for the prompt in the context of few-shot semantic parsing. However, rather than training a retriever, they randomly sample a large set of question-program pairs from the training set, and choose those that are similar to the target instance question according to GPT-3. This results in an expensive inference procedure, where GPT-3 is run hundreds of times for each test instance, unlike our approach, which only uses a light-weight sub-linear retriever at test time.

3 Efficient Prompt Retriever

We now describe our method for training EPR, an efficient prompt retriever for in-context learning. We first describe how to generate labeled data (Section 3.1), and then how to use the training data for training and inference (Section 3.2). Fig. 2 provides an overview of the training procedure.

3.1 Generating the Training Data

Our approach relies on finding which training examples can serve as good prompts for other training examples. Scoring all pairs of training examples is quadratic in , and thus prohibitive. Hence, we need a method for choosing a set of candidate examples , from which we will choose positive and negative examples for training. Importantly, since we are not at test time and are only generating data for training, we can use the target sequence to retrieve a good set of candidates. This leads to a problem that can be attacked by simple retrievers, given that our goal is to retrieve training examples that are similar to the input in terms of their output sequence, .

To obtain a high-quality candidate set of training examples, we take advantage of an unsupervised retriever, . For the choice of the unsupervised retriever, we experiment with BM25 Robertson and Zaragoza (2009), a sparse retriever that relies on surface text similarity, and SBERT (Reimers and Gurevych, 2019), which is based on dense sentence encoding. For both BM25 and SBERT, we experimented with passing the retriever the training pair or the target sequence only, and found that using leads to slightly higher performance.

Scoring the candidate set

Once we retrieve the set of candidates for a training example ,222We omit the dependence of on for simplicity. we score each candidate independently with a scoring LM, , which serves as a proxy for the inference LM, . Specifically, the score for a candidate prompt is

which is the probability under the LM, , of the output sequence conditioned on the candidate prompt concatenated to the input sequence. This indicates how helpful this candidate is for decoding the target (independent of all other candidates). We argue this score is a better proxy for the utility of a training example at inference time compared to prior approaches.

We apply this scoring procedure to all training examples, and then define for each training example a set of positive examples , which includes the top- candidates in according to , and a set of negative examples , which includes the bottom- candidates in according to . This should lead to relevant positive examples, assuming that the set of candidates, includes good prompt candidates, and hard negatives, since all candidates have high similarity with according to . With positive and negative examples at our disposal, we can now apply contrastive learning, which we describe next.

3.2 Training and Inference


Our training procedure proceeds exactly like the contrastive learning procedure from DPR Karpukhin et al. (2020). This procedure results in an input encoder , which receives the sequence of input tokens, , and a prompt encoder , which receives a candidate prompt, namely, a concatenation of the tokens in an input-output pair. Both encoders are initialized from BERT-base Devlin et al. (2019)

, and the output vector representation of both the input encoder and the prompt encoder is given by the

CLS token, as usual. The goal of training is to learn a similarity metric such that given a test example , it will be similar to training examples that lead to decoding of .

In each training batch, we sample training examples. For every batch example , we randomly sample one positive example from its corresponding set and one negative example from . We define the similarity score between an input and an input-output pair to be the inner product . We can now define the typical contrastive learning objective and minimize for each example the negative log likelihood of the positive example:

An advantage of this approach is that for a batch size the effective batch size is , due to the use of the in-batch negatives trick (Henderson et al., 2017).


After training the input encoder and prompt encoder, we encode the entire set of training examples with in a pre-processing step using FAISS (Johnson et al., 2017). At test time, given an input sequence , we compute its encoding , and then use maximum inner-product search over the training data to find the most similar training examples, sorted by their inner product (from high to low): . The final prompt is determined by the maximal context size supported by the inference LM, . Specifically, , where is the largest such that , where is an upper bound on the length of the generated output, and is the maximal context size supported by . Finally, we return the output of greedy decoding on .

We note that while at training time we score each training example independently, at test time the language model observes a prompt, i.e., a set of examples. We leave modeling the dependence between different training examples to future work.

4 Experimental Results

We now describe our experimental evaluation of EPR, where we compare EPR to a wide range of unsupervised and supervised baselines, both when the scoring LM, , is smaller than the inference LM, , and when they are identical.

Dataset Size Utterance Meaning Representation
Break 52K There are more birds in the image on the right than in the image on the left. 1) return right image; 2) return birds in #1; 3) return number of #2; 4) return left image; 5) return birds in #4 6) return number of #5; 7) return if #3 is higher than #6;
MTop 17K call Zoey’s wife. [IN:CREATE_CALL =  [SL:CONTACT = [IN:GET_CONTACT =   [SL:CONTACT_RELATED = Zoey]    [SL:TYPE_RELATION = wife]]]]
Give me the weather for March 13th. [IN:GETWEATHER  [SL:DATE_TIME for March 13th ] ]
SMCalFlow 148K Can you create me a new meeting on thursday morning? (Yield (CreateCommitEventWrapper  (CreatePreflightEventWrapper   (Event.start_?    (DateTimeConstraint (Morning)     (NextDOW (Thursday)))))))
Schedule lunch for the late afternoon today. (Yield(CreateCommitEventWrapper  (CreatePreflightEventWrapper   (& (Event.subject_?    (?= "lunch")) (Event.start_?     (DateTimeConstraint      (LateAfternoon) (Today)))))))
Table 1: The size and 1-2 examples from each of the datasets we evaluate on.

4.1 Datasets

We focus on tasks that map utterances to meaning representations, where in-context examples can be used to learn the mapping from inputs to outputs. Examples from each dataset and the number of examples are in Table 1.

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • Break (Wolfson et al., 2020) is a dataset that maps complex natural language questions into a language-based meaning representation, where a question is decomposed into an ordered list of atomic steps expressed in natural language. We use the low-level Break subset, which provides a more fine-grained decomposition over a wide range of domains and modalities. Break contains 44K training examples and 8K development set examples.

  • MTop (Li et al., 2021) is a recent semantic parsing dataset, focused on task-oriented dialogues, where commands are mapped to complex nested queries across 11 domains, such as alarm, messaging, music, recipes, etc. Similar to past work Pasupat et al. (2021), we use the English subset of MTop, which contains 16K training examples and 2K development set examples.

  • SMCalFlow (Andreas et al., 2020) is a large English-language task-oriented dataset that covers tasks such as calendar, weather, places, and people. The meaning representation is a dataflow program, which includes API calls, function composition and complex constraints. SMCalFlow includes 134K training examples and 15k development set examples, from which we sample a random 44K for training.

4.2 Baselines and Oracles

We consider the following unsupervised baselines, which are applied at at test time only.

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • Random: we randomly sample examples from the training set .

  • SBERT: We use SentenceTransformers, a library providing BERT-based sentence embeddings.333https://www.sbert.net/index.html. Specifically, we use paraphrase-mpnet-base-v2, a 110M parameter model to encode the test utterance and retrieve the examples with the most similar utterances for the prompt.

  • BM25: We use the classical sparse retrieval method BM25 Robertson and Zaragoza (2009), which is an extension of TF-IDF, to retrieve for each test utterance the training examples with the most similar utterance.

  • BruteForce: We apply the dynamic prompt selection method for few-shot semantic parsing from shin-etal-2021-constrained. Given a test example , we randomly sample 200 training examples. For each training example , compute , and use the highest scoring examples for the prompt. Similar to us, this approach uses the inference LM to choose prompt examples. However, it does so at test time, which results in very slow inference compared to a sub-linear prompt retriever.

Next, we describe baselines that use the training set, , to train a prompt retriever. All supervised methods share the following procedure. First, a candidate set is retrieved with the unsupervised retriever . We use BM25 as an unsupervised retriever, since it outperformed SBERT (see Section 4.4). Moreover, we retrieve with only, since this outperformed retrieving with the pair . We then score each candidate prompt with some scoring function, and label the top- prompts as positive examples and the bottom- as negative examples. Different supervised methods only differ in the scoring function itself.

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • DR-BM25: Here, we use the original BM25 scores for labeling positive and negative examples and training a dense retriever.

  • Case-based Reasoning (CBR): We adapt to our setup the scoring function from das-etal-2021-case, who performed prompt retrieval in the context of knowledge-base question answering. Specifically, das-etal-2021-case defined the weight for a pair of logical forms to be the F score between the two sets of relations appearing in those logical forms, and use this weight to softly label their data. Since in our setting we do not assume logical forms, we define the score between two output sequence and to be the F between the two sets of tokens in and , omitting stop words.

  • Efficient Prompt Retrieval (EPR): Our full approach from Section 3, where we score candidate prompts with the scoring LM.

Last, we consider two oracle models.

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • BM25-Oracle: We score test examples using BM25 using the gold output sequence . This provides an upper-bound on what can be learned by DR-BM25. EPR can potentially outperform this oracle, since its training signal goes beyond surface text similarity.

  • LM-Oracle: We use the procedure for labeling training data at test time. Given a test example , we first retrieve candidate training examples with , we then sort the candidate examples with the scoring LM , estimating the probability of given and the candidate prompt. This provides an upper bound for EPR, since EPR is trained to emulate this behaviour.

4.3 Experimental Details

Language models

In this work, we only train a dense retriever, but use scoring and inference LMs. For our scoring LM, , we use GPT-Neo (Black et al., 2021), a 2.7B-parameter LM trained on The Pile (Gao et al., 2021), an 825 GB English text corpus, constructed from a wide range of high-quality resources. In addition, we consider the following inference LMs:

  • [leftmargin=*,itemsep=0pt,topsep=0pt]

  • GPT-J Wang and Komatsuzaki (2021): a 6B-parameter LM, also trained on The Pile. The advantage in this setup, is that GPT-J was trained on the same corpus as GPT-Neo. However, it is only 2.2x larger.

  • GPT-3 Brown et al. (2020): A 175B-parameter model, trained mostly on a filtered subset of common crawl.

  • Codex Chen et al. (2021): A 175B-parameter model, trained mostly on code from GitHub. Since our tasks involve mapping from utterances to programs or meaning representations, Codex might potentially perform well at in-context learning.

For all LMs, we use a maximum context size of 2,048 tokens.


On Break, we evaluate performance with LF-EM (Hasson and Berant, 2021), proposed as an improvement to exact match (EM), as it measures whether two meaning representations are semantically equivalent. On MTop and SMCalFlow, we evaluate with EM, i.e., whether the string output by the inference LM is identical to the reference string.

We evaluate EPR in two settings: (a) LM-as-a-service, and (b) LM-as-a-proxy. In the first setting, we use GPT-Neo as both the scoring LM and inference LM. In this setting, we evaluate our approach on the full development sets of Break, MTop, and SMCalFlow. In the latter setting, as we access GPT-3 and Codex through a paid API, we sample a random subset of 1,000 development examples from each dataset and evaluate a few methods on this subset only.

Training details

In all cases the number of examples retrieved by the retriever , and the number of positive and negative examples . To train EPR, we use the Adam optimizer Kingma and Ba (2015)

with batch size 120 and learning rate 1e-4 on eight RTX 3090. We run training for 30 epochs. We used the default DPR hyperparameters without tuning.

4.4 Results

Model Break MTop SMCalFlow
Unsuper. Random 1.7 7.3 8.9
SBERT 21.6 48.7 43.6
BM25 26.0 52.9 46.1
Bruteforce 7.7 18.1 11.1
Super. DR-BM25 23.6 50.2 43.1
CBR 25.7 57.0 51.4
EPR (ours) 31.9 64.2 54.3
Oracle BM25-Oracle 32.3 58.9 47.3
LM-Oracle 43.1 71.6 73.7
Table 2: Development results when GPT-Neo is both the scoring and inference LM. Numbers shown for Break are LF-EM, and for MTop and SMCalFlow are EM accuracy.


Table 2 reports the results of the LM-as-a-service setup where the scoring and inference LMs are identical. We observe that EPR substantially outperforms all other baselines. Specifically it improves performance from 26.031.9 on Break, from 50.264.2 on MTop, and from 51.454.3 on SMCalFlow. This shows that using the LM itself to label examples is an effective approach for obtaining a strong prompt retriever. In terms of supervised baselines, CBR outperforms DR-BM25 on MTop and SMCalFlow, and the two are comparable on Break.

For the unsupervised methods, Random provides a lower bound and demonstrates that random sampling of training examples leads to poor performance. BM25, which employs surface text similarity outperforms the dense BERT-based retriever SBERT for prompt retrieval, and consequently we use BM25 in all of our supervised approaches to retrieve the set of candidates, . Last, BruteForce performs worse than BM25. We assume this is since the training sets are large (14-120K examples), and sampling 200 random examples does not cover enough examples that are useful for GPT-Neo.

Interestingly, EPR outperforms BM25-Oracle on MTop and SMCalFlow and is comparable on Break. This is surprising since BM25-Oracle has access to the output sequence at test time, illustrating that the signal provided by the scoring LM for training goes beyond surface text similarity.

The performance of LM-Oracle is substantially higher than EPR. This shows that the supervision provided by the scoring LM is strong, and training a better retriever from this signal can substantially enhance performance.

Model One-shot Full-context
Unsuper. Random 1.1 1.7
BM25 15.2 26.0
Super. DR-BM25 14.1 23.6
CBR 14.5 25.7
EPR 23.0 31.9
Oracle BM25-Oracle 18.0 32.3
LM-Oracle 33.3 43.1
Anycorrect-Oracle 53.6 -
Table 3: Development results on BREAK with GPT-Neo in the one-shot setting. Numbers shown are LF-EM. Full-context is the corresponding numbers from Table 2.

We further evaluate our models in the one-shot setup, that is, when the prompt given to the inference LM includes a single sample (the highest scoring one). In this setup, the inference LM is applied in the same setting as when we generate labeled data, where we go over each prompt candidate independently. Since train and test time are now closer, we can expect the advantage of EPR to be more pronounced.

Table 3 shows the results of this experiment. Indeed we observe that EPR outperforms the best baseline by 8.5%, and even BM25-Oracle by 5%. In addition, we examine AnyCorrect-Oracle, which tests whether any of the candidates returned by BM25 leads to the correct output. We observe that AnyCorrect-Oracle reaches 53.6%, 20 points above LM-Oracle. This shows that quality of the list of candidates provided by BM25, as you can reach more than 50% LF-EM with just a single prompt. Moreover, it hints that a better scoring function can potentially further improve the efficacy of our approach.

Break MTop SMCalFlow
Method Random BM25 CBR EPR Random BM25 CBR EPR Random BM25 CBR EPR
GPT-3 4.2 20.1 21.3 25.3 7.6 52.5 54.8 62.6 5.8 35.3 41.6 46.5
Codex 8.9 24.5 24.2 29.5 10.8 60.6 59.4 66.1 7.2 45.1 48.7 50.3
GPT-J 3.3 26.7 26.7 31.5 8.8 56.6 58.0 65.4 10.6 50.4 50.9 57.4
GPT-Neo 1.0 22.8 25.8 29.9 7.6 52.8 55.4 63.6 8.0 46.1 50.1 53.5
Table 4: Results on a random sample of 1,000 examples from the development set when using GPT-Neo as a scoring LM across different inference LMs and datasets.


Table 4 shows results in the setup where the scoring LM is GPT-Neo and the inference LM is a larger LM (we also report GPT-Neo for reference). First, we observe that the trends are very similar to the LM-as-a-service setup, i.e., EPR substantially outperforms prior baselines, including our best unsupervised baseline, BM25, and the best supervised baseline, CBR, by 2-8 points on all datasets and all pre-trained models. Thus, GPT-Neo serves as a good proxy for choosing training examples.

Zooming in on different inference LMs, GPT-J performs slightly better than GPT-Neo across the board. This is expected as the two models were trained on the same data and using the same procedure and only different in the number of their parameters. Codex outperforms GPT-3, which can be explained by the fact that it was fine-tuned on code tasks, and the datasets we experiments with involve mapping to programs or meaning representations. Curiously however, GPT-J outperforms both Codex (except on MTop) and GPT-3 despite the fact that it is 30x smaller. This perhaps can be explained by the fact that GPT-J was trained on a different dataset (The Pile (Gao et al., 2021)), but we leave investigation of whether performance with GPT-3 and Codex can be further improved for future work.

4.5 Analysis

Test Example Utterance Give the code of the airport with the least flights.
1) flights of #1 2) number of #2 for each #1 3) #1 where #3 is lowest 4) code of #4
Top-1 Utterance What is the code of the city with the most students? What destination has the fewest number of flights?
1) cities 2) students in #1 3) number of #2 for each #1 4) #1 where #3 is highest 5) code of #4 1) destinations 2) flights of #1 3) number of #2 for each #1 4) #1 where #3 is lowest
Top-2 Utterance Return the code of the city that has the most students. Which destination has least number of flights?
1) cities 2) students in #1 3) number of #2 for each #1 4) #1 where #3 is highest 5) code of #4 1) destinations 2) flights to #1 3) number of #2 for each #1 4) #1 where #3 is lowest
Top-3 Utterance Find the count and code of the job has most employees. What is the number of airports per country, ordered from most to least?
1) jobs 2) employees of #1 3) number of #2 for each #1 4) #1 where #3 is highest 5) employees of #4 6) number of #5 7) code of #4 8) #6 , #7 1) countries 2) airports in #1 3) number of #2 for each #1 4) #3 sorted by most to least
Table 5: An example from Break development set where EPR is correct and CBR is incorrect along with the top-3 training examples retrieved from each retriever.

Example prompts

Table 5 shows an example from Break where EPR decodes the correct output, while CBR does not. All training examples retrieved by EPR perform an argmax operation (argmin in the original utterance), and return in the final step “a code”, while the third example retrieved by CBR does not perform an argmax or argmin operation, and all examples retrieved by CBR do not involve “a code”. We provide additional examples from MTop and SMCalFlow in the Appendix.

Recall@ of EPR

To look more closely at the retrieval results of EPR, we perform the procedure for labeling positive examples using the scoring LM on the development set. We then measure, for various values of , whether EPR returns at least one of the positive examples in its top- prompts. We see that EPR performs quite well, retrieving at least one positive example in the top-50 prompts in more than 80% of the cases.

Figure 3: Recall@ for EPR on Break, w.r.t labels given by the scoring LM. We compute Recall@ by checking whether the top- prompts retrieved include at least one example that was labeled as positive by the scoring LM.

5 Related Work


Research on training dense retrievers has skyrocketed recently, propelled by rising interest in open-domain question answering Chen et al. (2017); Lee et al. (2019); Karpukhin et al. (2020); Guu et al. (2020); Khattab and Zaharia (2020); Qu et al. (2021). Work on retrieval-based methods has also spread more widely to other knowledge-intensive tasks Lewis et al. (2020), e.g., fact verification Samarinas et al. (2021).

Similar to us, pasupat-etal-2021-controllable have recently proposed to use retrieval in the context of semantic parsing. However, their goal is not to improve in-context learning, but instead to control the output generated by a sequence-to-sequence model. Retrieval methods have also been successfully used in language modeling itself Khandelwal et al. (2020); Borgeaud et al. (2021) and machine translation Khandelwal et al. (2021).


Developing methods for interacting with large language models and extracting desired behaviours has attracted considerable attention recently, under the umbrella term prompting. In this work, prompts are simply a set of in-context training examples, but substantial effort has also been devoted to casting natural language tasks as language modeling by phrasing the target task in natural language (see extensive survey in Liu et al. (2021b)). Such approaches include prompt engineering through the use of manual patterns Petroni et al. (2019); Schick and Schütze (2021), and also methods for extracting either hard Shin et al. (2020); Haviv et al. (2021) or soft Li and Liang (2021); Zhong et al. (2021); Qin and Eisner (2021) prompts automatically.

Constrained decoding

shin-etal-2021-constrained used GPT-3 to select training examples for in-context learning. However, their focus was not on training a prompt retriever, but instead on representing logical forms with a pseudo-language, and applying constraints that are based on the formal language at decoding time to improve generation. Here, we do not explore constrained decoding, since this is orthogonal to our research question. However, when constraints can be applied, this is likely to further enhance performance.

6 Conclusions

Very large pre-trained LMs are becoming an inseparable part of the natural language understanding eco-system. However, accessing their weights or updating them through backpropagation can be prohibitive or even impossible for many researchers. In this work, we propose EPR, a method for learning to retrieve good prompts for in-context learning, by using

language models themselves as the scoring function. This allows us to train a light-weight efficient retriever and substantially improve performance compared to strong baselines on three challenging sequence-to-sequence tasks.

More broadly, given that large language models are going to play a prominent role in developing language understanding models in the future, it is important to develop approaches for interacting with such models effectively. EPR can be viewed as a step in this direction, and future work can further expand on this idea by retrieving better training examples, composing effective prompts, and post-editing the output generated by language models.


We thank Ori Ram for helpful suggestions. This research was supported in part by The Yandex Initiative for Machine Learning, and The European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800). This work was completed in partial fulfillment for the Ph.D degree of Ohad Rubin.


  • J. Andreas, J. Bufe, D. Burkett, C. Chen, J. Clausman, J. Crawford, K. Crim, J. DeLoach, L. Dorner, J. Eisner, H. Fang, A. Guo, D. Hall, K. Hayes, K. Hill, D. Ho, W. Iwaszuk, S. Jha, D. Klein, J. Krishnamurthy, T. Lanman, P. Liang, C. H. Lin, I. Lintsbakh, A. McGovern, A. Nisnevich, A. Pauls, D. Petters, B. Read, D. Roth, S. Roy, J. Rusak, B. Short, D. Slomin, B. Snyder, S. Striplin, Y. Su, Z. Tellman, S. Thomson, A. Vorobev, I. Witoszko, J. Wolfe, A. Wray, Y. Zhang, and A. Zotov (2020) Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics 8, pp. 556–571. External Links: Document, Link Cited by: §1, 3rd item.
  • S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman (2021)

    GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow

    Zenodo. External Links: Document, Link Cited by: §1, §4.3.
  • S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. v. d. Driessche, J. Lespiau, B. Damoc, A. Clark, et al. (2021) Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426. Cited by: §5.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §1, 2nd item.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1870–1879. External Links: Link, Document Cited by: §5.
  • M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, Alethea. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. W. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, I. Babuschkin, S. A. Balaji, S. Jain, A. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba (2021) Evaluating large language models trained on code. ArXiv preprint abs/2107.03374. External Links: Link Cited by: 3rd item.
  • R. Das, M. Zaheer, D. Thai, A. Godbole, E. Perez, J. Y. Lee, L. Tan, L. Polymenakos, and A. McCallum (2021) Case-based reasoning for natural language queries over knowledge bases. In

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Online and Punta Cana, Dominican Republic, pp. 9594–9611. External Links: Link Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document, Link Cited by: §1, §3.2.
  • N. Du, Y. Huang, A. M. Dai, D. L. Simon Tong, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. Bosma, Z. Zhou, T. Wang, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le, Y. Wu, Z. Chen, and C. Cui (2021) GLaM: efficient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905. Cited by: §1.
  • L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. (2021) The pile: an 800gb dataset of diverse text for language modeling. ArXiv preprint abs/2101.00027. External Links: Link Cited by: §4.3, §4.4.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) Retrieval augmented language model pre-training. In ICML, Cited by: §5.
  • M. Hasson and J. Berant (2021) Question decomposition with dependency graphs. ArXiv abs/2104.08647. Cited by: §4.3.
  • A. Haviv, J. Berant, and A. Globerson (2021) BERTese: learning to speak to BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 3618–3623. External Links: Link, Document Cited by: §5.
  • M. Henderson, R. Al-Rfou, B. Strope, Y. Sung, L. Lukacs, R. Guo, S. Kumar, B. Miklos, and R. Kurzweil (2017) Efficient natural language response suggestion for smart reply. External Links: 1705.00652 Cited by: §3.2.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. ArXiv preprint abs/1702.08734. External Links: Link Cited by: §3.2.
  • V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6769–6781. External Links: Document, Link Cited by: §3.2, §5.
  • U. Khandelwal, A. Fan, D. Jurafsky, L. Zettlemoyer, and M. Lewis (2021) Nearest Neighbor Machine Translation. In Proceedings of ICLR, Cited by: §5.
  • U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis (2020) Generalization through memorization: nearest neighbor language models. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §5.
  • O. Khattab and M. Zaharia (2020) Colbert: efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48. Cited by: §5.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of ICLR, Cited by: §4.3.
  • K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6086–6096. External Links: Link, Document Cited by: §5.
  • P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of NeurIPS, Cited by: §5.
  • H. Li, A. Arora, S. Chen, A. Gupta, S. Gupta, and Y. Mehdad (2021) MTOP: a comprehensive multilingual task-oriented semantic parsing benchmark. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 2950–2962. External Links: Link Cited by: §1, 2nd item.
  • X. L. Li and P. Liang (2021) Prefix-tuning: optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 4582–4597. External Links: Link, Document Cited by: §5.
  • O. Lieber, O. Sharir, B. Lenz, and Y. Shoham (2021) Jurassic-1: technical details and evaluation. White Paper. AI21 Labs. Cited by: §1.
  • J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen (2021a) What makes good in-context examples for gpt-3?. ArXiv preprint abs/2101.06804. External Links: Link Cited by: §1.
  • P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig (2021b) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. External Links: 2107.13586 Cited by: §5, footnote 1.
  • P. Pasupat, Y. Zhang, and K. Guu (2021) Controllable semantic parsing via retrieval augmentation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 7683–7698. External Links: Link Cited by: 2nd item.
  • F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019) Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2463–2473. External Links: Document, Link Cited by: §1, §5.
  • G. Qin and J. Eisner (2021) Learning how to ask: querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5203–5212. External Links: Link, Document Cited by: §5.
  • Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang (2021) RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5835–5847. External Links: Link, Document Cited by: §5.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Link Cited by: §1.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982–3992. External Links: Document, Link Cited by: §3.1.
  • S. Robertson and H. Zaragoza (2009) The probabilistic relevance framework: bm25 and beyond. Foundations and Trends in Information Retrieval 3, pp. 333–389. External Links: Document Cited by: §3.1, 3rd item.
  • C. Samarinas, W. Hsu, and M. L. Lee (2021) Improving evidence retrieval for automated explainable fact-checking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Online, pp. 84–91. External Links: Link, Document Cited by: §5.
  • T. Schick and H. Schütze (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 255–269. External Links: Link, Document Cited by: §5.
  • T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh (2020) AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 4222–4235. External Links: Link, Document Cited by: §5.
  • B. Wang and A. Komatsuzaki (2021) GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Note: https://github.com/kingoflolz/mesh-transformer-jax Cited by: 1st item.
  • T. Wolfson, M. Geva, A. Gupta, M. Gardner, Y. Goldberg, D. Deutch, and J. Berant (2020) Break it down: a question understanding benchmark. Transactions of the Association for Computational Linguistics 8, pp. 183–198. External Links: Document, Link Cited by: §1, 1st item.
  • Z. Zhong, D. Friedman, and D. Chen (2021) Factual probing is [MASK]: learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5017–5033. External Links: Link, Document Cited by: §5.

Appendix A Appendix

Tables 6, 7, and 8 provide more examples for cases where EPR is correct while CBR is incorrect along with the top-3 prompts for each method.

Test Example Utterance Remind me to add 2 dozen eggs to my grocery list.
[IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO add 2 dozen eggs to my grocery list ] ]
Top-1 Utterance Remind me to get two bottles of water. Please add a grocery list to my list of things to be reminded about doing today.
[IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO get two bottles of water ] ] [IN:CREATE_REMINDER [SL:TODO a grocery list ] [SL:PERSON_REMINDED my ] [SL:DATE_TIME today ] ]
Top-2 Utterance Remind me to bring an extra pair of shoes to the river. Remind me to make a grocery list
[IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO bring an extra pair of shoes to the river ] ] [IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO make a grocery list ] ]
Top-3 Utterance Remind me to add bottled water to grocery list. I need to make a grocery list; will you remind me when I get off work at 5:00 p.m.?
[IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO add bottled water to grocery list ] ] [IN:CREATE_REMINDER [SL:TODO make a grocery list ] [SL:PERSON_REMINDED me ] [SL:DATE_TIME at 5 : 00 p.m . ] ]
Table 6: An example from MTop development set where EPR is correct and CBR is incorrect along with the top-3 training examples retrieved from each retriever.
Test Example Utterance confirmed thanks
Top-1 Utterance it’s ok bye Yes, but make sure to let me know the weather for that time.
(PleasantryAnythingElseCombined) (let (x0 (Execute (^(Dynamic) ConfirmAndReturnAction))) (do (Yield x0) (Yield (WeatherForEvent (^(Dynamic) item x0)))))
Top-2 Utterance It’s ok Awesome, perfect
(PleasantryAnythingElseCombined) (Yield (Execute (^(Dynamic) ConfirmAndReturnAction)))
Top-3 Utterance It’s ok Perfect…
(PleasantryAnythingElseCombined) (Yield (Execute (^(Dynamic) ConfirmAndReturnAction)))
Table 7: An example from SMCalFlow development set where EPR is correct and CBR is incorrect along with the top-3 training examples retrieved from each retriever.
Test Example Utterance Create a meeting with David Crim today
(Yield (CreateCommitEventWrapper (CreatePreflightEventWrapper (& (Event.start_? (DateTime.date_? (?= (Today)))) (Event.attendees_? (AttendeeListHasRecipient (Execute (refer (extensionConstraint (RecipientWithNameLike (^(Recipient) EmptyStructConstraint) (PersonName.apply "David Crim")))))))))))
Top-1 Utterance make a meeting with jeri today set up a meeting with both of David Crim’s reports today
(Yield (CreateCommitEventWrapper (CreatePreflightEventWrapper (& (Event.start_? (DateTime.date_? (?= (Today)))) (Event.attendees_? (AttendeeListHasRecipient (Execute (refer (extensionConstraint (RecipientWithNameLike (^(Recipient) EmptyStructConstraint) (PersonName.apply "jeri"))))))))))) (Yield (CreateCommitEventWrapper (CreatePreflightEventWrapper (& (Event.start_? (DateTime.date_? (?= (Today)))) (Event.attendees_? (AttendeeListHasPeople (FindReports (Execute (refer (extensionConstraint (RecipientWithNameLike (^(Recipient) EmptyStructConstraint) (PersonName.apply "David Crim"))))))))))))
Top-2 Utterance put meeting with emlime on today Make a meeting with David Largenstop on the 24th.
(Yield (CreateCommitEventWrapper (CreatePreflightEventWrapper (& (Event.start_? (DateTime.date_? (?= (Today)))) (Event.attendees_? (AttendeeListHasRecipient (Execute (refer (extensionConstraint (RecipientWithNameLike (^(Recipient) EmptyStructConstraint) (PersonName.apply "emlime"))))))))))) (Yield (CreateCommitEventWrapper (CreatePreflightEventWrapper (& (Event.start_? (DateTime.date_? (?= (nextDayOfMonth (Today) 24L)))) (Event.attendees_? (AttendeeListHasRecipient (Execute (refer (extensionConstraint (RecipientWithNameLike (^(Recipient) EmptyStructConstraint) (PersonName.apply "David Largenstop")))))))))))
Top-3 Utterance I want meet Dr Kennady from today create a meet with bob today
(Yield (CreateCommitEventWrapper (CreatePreflightEventWrapper (& (Event.start_? (DateTime.date_? (?= (Today)))) (Event.attendees_? (AttendeeListHasRecipient (Execute (refer (extensionConstraint (RecipientWithNameLike (^(Recipient) EmptyStructConstraint) (PersonName.apply "Dr Kennady"))))))))))) (Yield (CreateCommitEventWrapper (CreatePreflightEventWrapper (& (Event.start_? (DateTime.date_? (?= (Today)))) (Event.attendees_? (AttendeeListHasRecipient (Execute (refer (extensionConstraint (RecipientWithNameLike (^(Recipient) EmptyStructConstraint) (PersonName.apply "bob")))))))))))
Table 8: An example from SMCalFlow development set where EPR is correct and CBR is incorrect along with the top-3 training examples retrieved from each retriever.