Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models

by   Hendrik Strobelt, et al.

State-of-the-art neural language models can now be used to solve ad-hoc language tasks through zero-shot prompting without the need for supervised training. This approach has gained popularity in recent years, and researchers have demonstrated prompts that achieve strong accuracy on specific NLP tasks. However, finding a prompt for new tasks requires experimentation. Different prompt templates with different wording choices lead to significant accuracy differences. PromptIDE allows users to experiment with prompt variations, visualize prompt performance, and iteratively optimize prompts. We developed a workflow that allows users to first focus on model feedback using small data before moving on to a large data regime that allows empirical grounding of promising prompts using quantitative measures of the task. The tool then allows easy deployment of the newly created ad-hoc models. We demonstrate the utility of PromptIDE (demo at and our workflow using several real-world use cases.


page 1

page 6


A Deep Relevance Matching Model for Ad-hoc Retrieval

In recent years, deep neural networks have led to exciting breakthroughs...

Structured abbreviation expansion in context

Ad hoc abbreviations are commonly found in informal communication channe...

Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need

Designing pre-training objectives that more closely resemble the downstr...

SeeSaw: interactive ad-hoc search over image databases

As image datasets become ubiquitous, the problem of ad-hoc searches over...

View Composition Algebra for Ad Hoc Comparison

Comparison is a core task in visual analysis. Although there are numerou...

CathAI: Fully Automated Interpretation of Coronary Angiograms Using Neural Networks

Coronary heart disease (CHD) is the leading cause of adult death in the ...

Extending the View Composition Algebra to Hierarchical Data

Comparison is a core task in visual analysis. Although there are numerou...

1 Related Work

1.1 Prompting as an Interface

The use of large language models as a replacement for supervised learning was popularized by the GPT series of language models 

[29]. Prompting both in the zero-short and few-shot settings has been explored widely in NLP tasks [32, 15, 8, 43]. We consider only the case of human-readable prompts to large models, which contrasts with methods such as prompt-tuning [20, 16], which learns a continuous prompt embedding, and auto-prompting [33], which attempts to search for prompts from scratch, both of which require a training step. For more examples see a recent survey by Liu et al. [21]. Current prompt-based models are primarily based on Transformers [38], which has become the de-facto model architecture for NLP models; however, nothing about our visual analysis is specific to the model used or task considered, only the prompting interface.

Prompting as a means to interact with generative models is prevalent in online demos that often deploy Transformers as assistive agents. For instance, OpenAI released a (closed-source) API with a simple demo to interact with their proprietary GPT-3 model using text [25].111Many demos built around this API can be found at Another notable example is Github CoPilot [10], which uses OpenAI’s Codex [4], a GPT-like language model trained on code, to allow everyday programmers to turn natural language prompts into code through an IDE like VSCode. With this model, a handful of well-crafted natural language instructions can code entirely functional browser-based video games [23]. The flexibility and effectiveness of prompt-based approaches encourage the use of prompting as the preferred way to interact with powerful NLP models.

The flexibility of prompts as an interface also comes with a cost, as downstream performance is closely tied to prompt wording. Writing good prompts to extract the desired information from a model is usually left to trial and error by the user, with few exceptions. One tool that seeks to solve this problem is [46], a visual tool that uses the GPT-3 API to explore how a user-specified prompt template affects the behavior of GPT-3 on individual examples, in conversations, and with different generation parameters. However, its input space is limiting, allowing only one prompt template with limiting syntax to be tested at a time. PromptSource [2] increases the power of the templating language used to write general prompts. It also provides a platform for the community to create, evaluate, and explore new prompts. PromptIDE extends this work, providing a principled workflow to automate the time-intensive process of creating and evaluating many prompt templates on different models, datasets, and tasks.

1.2 Visualization for NLP

Visualization tools to interact with language models have grown increasingly popular alongside the rise of the models themselves. These tools can serve several functions: (1) to expose the internals of a particular architecture (e.g., Transformers, RNNs, LSTMs) to understand how it encodes knowledge of the language [11, 39, 1, 40, 34, 6, 19, 24, 12]; (2) to explore and compare the behavior of a model’s internal distributions during text generation [9, 35]; and (3) to understand how model behavior differs under controlled input or parameter changes [36, 26, 22]. PromptIDE is agnostic to the underlying language model and thus serves the functions of (2) and (3).

Other visualization tools that treat NLP models as black boxes focus on visualizations of output distributions and model performance with single custom input or static sets of inputs. For example, GLTR [9]

focuses on visualizing output probability distributions to support humans in detecting whether a provided text was generated by a model. LMdiff 

[35] extends these visualizations to support comparisons between different language models on user-provided inputs and static corpora. Neither of these provides an exploration of the input space. The Language Interpretability Tool (LIT) [36] is a comprehensive toolkit that enables rapid exploration and error analysis for a model on a larger input dataset. However, LIT’s comprehensive analyses are not conducive for rapid, iterative improvements of a prompt on general NLP tasks. NLIZE [22] serves as a debugging tool that evaluates how a language model outputs changes as a result of perturbations to its hidden state rather than its input. Unlike these existing tools, PromptIDE enables exploration and evaluation of the infinitely large space of possible input prompts.

2 Model: Prompting for NLP

Figure 1: A language model can generate text by ranking

all possible events (tokens) at each time step based on their probability. Often, only

top-k tokens are considered as generation output. In PromptIDE, only the answer ranks of defined answer choices are compared against each other.

Prompting is a paradigm for solving ad-hoc NLP tasks. We particularly focus on zero-shot prompting that assumes we do not have access to any training examples. Prompting is typically used in conjunction with large pre-trained language models [29, 3]. These models are powerful, but their size makes them difficult to train directly, which further encourages this style of zero-shot prompting.

Prompting assumes access to a large language model (LM) pre-trained on generic text. An LM is a probabilistic model over text. Given an input text , it gives the probability of output text

. The idea of the prompting technique has been facilitated by recent improvements in these models, primarily deriving from scaling Transformer neural networks 

[38]. Recently, researchers have trained LMs that are directly targeted for the end-use of prompting [31, 42]. These language models can all be queried in a standard way. In this work, we utilize three different LM queries (Figure 1):


- Sample a target output for a given input,

Ranking - Compare the rank score of different texts, where is a function based on [3], details in subsection 4.5),

Top-K - Find the highest probability outputs from the model,

We represent an NLP task as a table of examples, each associated with a fixed set of fields and a label. As a running example, we consider the task of document topic classification. For this task, there is one field, the article text, and one label, the topic of the article. The document text might consist of,

Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday…

whereas the corresponding topic label would be World.

Contrast the standard ML approach for this task to prompting. A standard approach would collect supervised labels for the task and train a model on these examples. In zero-shot prompting, we do not have access to a training set. Prompting facilitates ad-hoc models by converting each test example directly to a form natural language input to which a large LM can respond. A user introduces a prompt template and a set of answer choices. The prompt template describes how to map the example fields to an input string , whereas the answer choices describe how to convert potentials outputs back to labels for the task.

We follow the work of PromptSource [2], where researchers introduced a format for describing prompt templates.

In which section of a newspaper would the text appear? {document}

with answer choices given as a dictionary of labels:

[World, Sports, Business, Science and Technology].

Utilizing the prompt template, we can construct an example prompt for model conditioning and answer choices . Each of the choices can then be ranked under the model to provide an evaluation score for the dataset.

We also support other task formats such as multiple-choice tasks that allow different answers for each example depending on specific fields in the data set.

{question} Choose between the following answers:
(A) { answerA } (B) { answerB } …

Here the answer choices could either be the corresponding letters or the answers themselves. We discuss this decision in Section 5.2.

Throughout, we assume a small set of labeled validation data for quantitative evaluation, which differs from the research on true zero-shot learning [27]. We note though that this is not enough data to attempt to automatically generate prompts directly. The main elements of prompting can be summarized as:

M1 - Prompt Template. A user writes a prompt template consisting of a task description in natural language that utilizes the fields from the task in a situated context. This leads to the construction of the input that is used for conditioning of the large LM.

M2 - Answer Choices. A user provides a dictionary of answer choices paired with the original labels for a given task that offer different possible output wordings to be considered by the model. The underlying model uses ranking to determine which of these answer choices to select. The original label paired with this answer choice is then the classification choice selected.

M3 - Evaluation. A user can evaluate the current version of the system under a known prompt for a set of validation data. This step will provide a proxy score for how well the given wording of the prompt is at capturing the underlying task.

3 Goals and Tasks

We held discussions with our NLP researchers on the team to determine the functionality for a minimal viable prototype. We imagined how our tool could enable our example personas journalist and analyst to work with prompting. In the following, we summarize the insights of these discussions.

While prompting is a promising approach, it is still too work-intensive for many use cases. The problem is that performance is highly dependent on the specific wording choices for templates (M1, M2), which is reflected in a high variance in accuracy (M3). For example, previous work has shown that different choices of prompts often lead to a more than 10-point spread in task accuracy between the best and worse choices at stage M1, even though both were approved by human editors [31].

Brute force approaches for prompt search require a user to write a large set of possible prompts and validate them empirically on a task. However, these approaches are both computationally expensive and slow. An interactive tool should provide an approach for using fewer resources and allow fast iterations for prompt engineering.

3.1 Goals

The high-level aim of PromptIDE is to provide a better approach for prompt development by domain experts in terms of four targeted goals:

G1 - Support a broad set of ad-hoc NLP tasks. It is essential that the system is generic, as the user may not know beforehand the nature of the end-user task. The tool should present a single interface for multiple different potential downstream tasks. In this way, a tool is not targeted specifically to model trainers but also to the end-user’s application goals.

G2 - Faster and more informed prompt writing through feedback from data. The process should let users target new language tasks that arrive during their projects. A tool should enable the user to develop prompts efficiently. It should also provide feedback on what effect prompt variations have. The goal is to make prompt search less automated by giving the user the human-in-the-loop ability to edit and construct prompts based on their domain knowledge.

G3 - Ground prompt choices in quantitative measures.

Prompt customization replaces training for ad-hoc systems, but it is still critical that choices be grounded in task evaluation metrics. A key element of interaction in the system is that task scores be directly available in the tool itself.

G4 - Ease deployment of models to end uses. The tool should provide a testbed for the wording and usage of the prompts, but for actual usage, the prompts must be able to be run and used on actual data cases. Our goal is for the user to be able to directly export the prompts written in the system to a full system for use on the final task itself.

3.2 Tasks

Given these goals, we identified a series of tasks that guided the development of PromptIDE:

Task 1 - Formulating and trying out prompts and prompt variations. To allow users to quickly explore and run different prompt templates for a specific NLP task, the tool provides an interactive interface for formulating and trying out many different prompts in a manner that provides feedback on a small set of real data examples. This interface is agnostic to the NLP task. [G1, G2]

Task 2 - Encoding prediction details of the model. Even after finding a good prompt template, it is critical to ensure that the answer choices provided ensure that the prompt is useful and leads to successful results on realistic NLP problems. The tool exposes the predictions of the model beyond ranking to allow for other choices that could be selected. [G2, G3]

Task 3 - Testing promising prompts on task performance. From Task 1 and Task 2, there are many combinations of prompt templates and answer choices to be considered, and each may be run on many different examples. Testing provides insight to the user on how choices for promising candidates lead to different resulting outcomes in a larger data regime and informs them which elements of their design have been successful. [G3]

Task 4 - Export prompt for concrete deployment. The end goal of the system is to provide packaged prompts that can be used in real tasks after the exploration phase has concluded. For this to be useful, the final system must collapse the exploration steps and provide a full prompt for deployment. [G4]

Figure 2: PromptIDE UI is organized like a notebook with foldable sections that follow the order of the main workflow but also allow quick iterations within a section or between neighboring sections.
Figure 3: Dataset navigation. Browsing through dataset samples and dataset schema lets the user get acquainted with the data. In this case, “label” refers to the index position within a list of potential labels.

4 Design

The visual interface of PromptIDE provides means to address the aforementioned tasks. With the text being the main carrier of information for the tasks, we designate the most screen real estate to text and interaction with text while introducing visual encodings for abstraction when useful.

At a high level, PromptIDE appears as a continuous notebook of four foldable sections (Figure 2) that lead to the following workflow: a dataset navigation section to select and browse data, a prompt variation section to broadly explore prompt variations, a prompt refinement section to help fine-tune a specific prompt, and a prompt testing section to explore results of testing on a larger scale.

4.1 The Four Sections of PromptIDE

The dataset navigation section (Figure 3) enables browsing and selection of a reference dataset needed for testing prompt templates against tasks T1–T3 and respective goals G1–G3. It provides access to many standard NLP data using the huggingface datasets[18]. In addition, a user can provide their own data as CSV or JSON files. While describing the dataset schema is syntactically sufficient, the tool enables browsing to help get a better understanding of the encoded semantics behind each data dimension. E.g., in Figure 3, the dataset glue/sst2 (Stanford Sentiment Treebank v2) is selected, and the schema indicates the presence of three fields: sentence, label, and idx. The naming of fields and the respective datatypes match the intuition of how a sentiment classification dataset could look like. But browsing shows that sentences can be very short and of low quality.

Figure 4: Prompt variation section. The user can quickly generate and progressively test variations of prompts to identify promising candidates. Each prompt variation is represented as a template card highlighting the values for template variables in unique colors and showing the number of correct answers vs. data tested.
Figure 5:

Prompt refinement section. To incorporate small optimizations for promising prompts, the user can test against a small subset of data frequently. For each data item, the evaluation chip indicates the prediction, the ground truth, if they match (green or gray), and the normalized distribution of probabilities as a bar chart on top.

The prompt variation section (Figure 4 and Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models) allows formulating a prompt experiment for broad variations of prompt templates (M1 in section 2) using up to three template variables , and spanning their combinatorial space . The user can formulate a prompt template (M1) using dataset fields (e.g., {{text}}), template variables , and plain template text. The list of answer choices (M2) can be formulated as a static or dynamic list (e.g., World Sports Business). The correct answer (M3) can be dynamically retrieved from the dataset and the answer choices (e.g., answer_choices[label]).

After defining the space of all variations for , the model can now be asked to predict answers for all variations on a small set of data items. As soon as the experiment is started, each variation of is represented as a template card, highlighting the template variables by a preassigned color ( in red, in blue, in gold) and showing the sum of correctly evaluated samples as a bar chart. Then, progressively, the full set of variations is tested against the set of data items, adding results for one data item at a time. For each step in the progression, the order of template cards is updated, keeping the stack sorted by decreasing performance against the ad-hoc task. At any time, when the user has gathered sufficient evidence of what could be promising candidates, they can stop the progression and re-iterate before the experiment would have been finished. This procedure of iteratively formulating prompt variations and trying them out on a small dataset addresses task 1 (G1, G2).

The prompt refinement section (Figure 5) takes one of the template variations and enables quick iterations for fine-tuning this template. Using only one variant allows the user to try on one batch of data items at interactive rates (T1, G1, G2). For each iteration, a performance overview is shown in an evaluation chip that indicates if the task was evaluated successfully for a data item (green background) or not (grey). It shows the predicted answer in black and the correct answer in green. If they match, only one is shown. On top of each evaluation chip, a bar chart indicates the relative probability of all possible answers sorted by rank and normalized to maximum probability.

The green bar highlights the ground truth, and the leftmost bar indicates the current prediction. This encoding allows, e.g., insights if a wrongful prediction was close to being a coin flip (similar height for most left bar and green bar) or if a correct prediction was a good choice (significant difference between the leftmost bar in green and the second bar from the left), targeting task T2. Besides running quick iterations using a small data regime, the user can trigger an experiment of testing against a larger test set (T3).

Figure 6: Prompt testing section. Showing results of testing against a mid-sized subset of data to get quantitative measures to help answer questions: How well did my prompt perform against the task (a)? What did the model confuse using my prompt (b)? How could I potentially tweak the answer choices (c)?

The results of this testing are shown in the prompt testing section (Figure 6) of PromptIDE. After the model completes the testing against a larger dataset, the results are presented such that a user might be able to answer the questions: How well did my prompt perform against the task (T3, G3)? What did the model confuse using my prompt (T2, T3, G3)? How could I potentially tweak the answer choices (T3, G3)?

For all three questions, PromptIDE provides a visual encoding that can help find an answer. A stacked bar chart indicates the share of correct predictions vs. incorrect ones to indicate prompt performance at the highest abstraction level (Figure 6

(a)). If the answer choices allow, a confusion matrix shows across class scores ((

Figure 6(b)) and if the in-class performance was good (large values on the diagonal axis). If the answer choices are dynamic but still form groups, the tool shows the top ten most abundant ground truth groups in the confusion matrix. For datasets where the answer choices do not fall into groups, nothing is shown.

Finally, to help answer the question about potential answer choice tweaks, the tool records the top five ranked generation tokens for each data item independently. Then, for each group of answer choices, these tokens are accumulated, and the number of appearances in the top five is recorded. Additionally, the average rank they had per answer group is calculated (value between 1 and 5). The list of the most frequent and highest-ranked answer tokens is shown in Figure 6(c), with tokens sorted by decreasing appearance count. The green background indicates the best average rank in the group (which does not need to be the most appearing item). The use case in subsection 5.1 gives an example of the practical use of this feature. Similar to the confusion matrix, nothing is shown if the answer choices do not form groups.

4.2 Visual Encoding and Interactions

Across the foldable sections, a user can investigate more detail about the prompted data items using the detail stripes (Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models). Each detail stripe consists of an upper part highlighting the answer options, the predicted answer (on top), and the ground truth answer (encoded in green). The panels below show details about the prompted text, the ground truth answer, and what the model has generated. Upon request, each ranked result can also show the probability of the respective answer option.

Detail stripes can be shown when tracking a specific template variation (Figure 4) in the prompt variation section. They can be unfolded beneath the performance chips in the prompt refinement section (Figure 5). They show detail about the respective subsets when the user clicks on the performance bars or the cells of the confusion matrix in the prompt testing section (Figure 6). Detail stripes target task T2 (G2).

Figure 7: Shopping carts allow users to collect their own prompt variants (left) or browse through prompts made available by the BigScience promptsource community (right).

To collect and store promising prompts, PromptIDE provides a shopping cart (Figure 7). From the cart, templates can be sent to the prompt variation section or the prompt refinement section. Templates can be exported from the cart to deploy them seamlessly with the tooling provided in the PromptIDE repository (T4, G4). To add prompts, a user can trigger buttons from the template cards in the prompt variation section (Figure 4) or the refinement section (Figure 5). If a prompt has been evaluated against the larger testing set, the performance result will be automatically added to the corresponding item in the cart for comparison. Furthermore, a read-only PromptSource shopping cart (Figure 5) shows templates that have been created for a specific dataset by the global community. These prompts could serve as inspiration for starting a user’s own prompt idea.

4.3 Example Interactive Workflow

A prototypical interaction workflow using PromptIDE starts with opening the data browser section to investigate the schema and concrete examples of own or globally available data. The user clicks on one item that serves as a good exemplar and is shown for reference in the prompt variations and prompt refinement sections. The user then opens the prompt variation section and writes down a prompt template using data and template variables. Alternatively, they open the prompt source cart and scan for examples that could serve as inspiration, copy them over, and add template variables. They run the experiment and stop it early because one prompt variation seems to perform very well. They use the shortcut to copy the specific prompt over to the refinement section, where, through multiple small edits and along multiple data portions, the prompt seems to be performing well. During this, the users observe the performance chips and occasionally the detail stripes. Later, the user triggers the larger-scale testing to see if they over-optimized the prompt for a local data range. After a short period of time, they note a high confusion between labels A and D. Inspired by the most common top five predictions, they iterate over their answer choices and run the test again, resulting in better performance. The prompt (and maybe some intermediate steps) are saved in the shopping cart. The best-performing prompt is exported to a JSON file. The user can now run the newly created ad-hoc model (original LM plus prompts in the JSON file) with a simple input-output interface or as a batch processing script on new data for their customers.

Figure 8: Early version of prompt variation testing. The interface is less powerful than the final version: only one template variable that can either be prefixed or appended to the prompt. Testing is only against one dataset item. Each answer option has its own color, which was not considered useful by early users.

4.4 Design Iterations and Rationale

While consulting with our co-authors, who are NLP domain experts, we went through multiple design iterations on different parts of the tool. In this section, we highlight some of them and provide a more in-depth design rationale for certain parts of PromptIDE.

The overall design as a notebook with foldable sections was inspired by the popular use of Jupyter notebooks in the NLP community. It allows occupying the screen with different views while keeping the views connected in a natural order. This allows the user to build a mental map that is established by scrolling up and down. If, instead, we had mapped the steps of the interactive workflow to independent views, the user might not be able to build this mental model due to the constant context switches. To assist with navigating to a specific subsection, the menu bar acts as a table of content that scroll-animates to the respective sections.

Figure 9: Use Case Document Classification for a news dataset. Details in subsection 5.1

Our use of progressive visualization methods [37] was driven by technical design considerations. We started thinking about what the modes of interaction during a progression would be. We decided to use progressive updates for the prompt variation testing because early stopping in this phase of the exploration has proven useful. For example, testing on data with just one label class may reveal local effects relatively quickly, in which case the user can stop the test. On the other hand, we did not use the early stopping interaction during prompt testing on a larger dataset. The goal for this part of the workflow is to observe more global effects by running the test to completion. Early stopping might increase the chance of observing a subset of the data that has local effects, negating the intention of larger-scale testing. Furthermore, the progression in the prompt variation section is controlled by the client and stops if the tool is closed. The larger-scale testing is controlled by the server and continues running even when PromptIDE is being closed on the client-side.

A fundamental technical design decision was which language to use for formulating the prompts with data and template variables (for T1). The three options for our decision were: (1) invent a new language, (2) use a templating language, or (3) use a general programming language. We quickly decided against inventing our own language that would have to be explained and maintained. The choice for using a templating language (Jinja) over a general programming language (Python) was based on the observation that PromptSource [31] used the same language. This made the PromptSource parsing work immediately available to us and also allowed us to build the read-only shopping cart with prompt examples gathered by the partner project. Another concern was that a general programming language could be a prime target for malicious server attacks once the tool is released to the public.

The detail stripes underwent several design iterations. An early version assigned a color to each answer choice (Figure 8). But our domain experts did not express interest in being able to track ranks of answer choices across data items and found it more distracting than useful. The switch to a simpler color encoding was also enforced by the required capability to allow dynamic answer options that change per data item and repeat a few times. This would have required an impractically large number of categorical colors.

4.5 Implementation

Our PromptIDE prototype consists of a backend in Python that communicates with a frontend written in Typescript and Svelte. We use the openly available pre-trained T0-3B large language model [31] provided through the huggingface platform 222 For long-running queries, the backend provides a custom-built queuing and execution system that keeps the memory footprint for the model low. In our implementation, answer options are ranked by the decreasing average log-likelihood:

with being the probability for the -th token of answer and being the answer’s token length.

The demo system is available at

. We will make the source code available upon acceptance of the paper. The open-source version allows easy use of custom data that is either provided as a CSV file or by the Huggingface dataset interface.

5 Use Cases

We illustrate how we can use PromptIDE to interactively prompt a diversity of NLP tasks, compare these prompts, save them and export them outside of PromptIDE. We experiment with a range of standard tasks in NLP, including document classification, reading comprehension, and natural language inference. The tool enables seamless development of prompts for a wide variety of tasks and formats [G1] while quickly providing feedback on prompts patterns that generalize to many instances [G2].

5.1 Document Classification

The most common end-user task for NLP [17] is document classification for an end-domain. The task of document classification is to determine the label of a document from a fixed set. It can be used in ad-hoc models for filtering a large set of documents or collecting statistics about a large collection. Domain expertise is critical in classification since the specific wording may lead to different results.

As an example use case we consider a prototypical version of this task with the goal of classifying the topic of a document. The AG News dataset 

[44] is often used for benchmarking this task and consists of text documents and labels that indicate which topic they are associated with (labels are canonically listed as “World, Sports, Business, Sci/Tech”). This task (introduced in section 2) is representative of an ad-hoc classification task that a user might consider for prompt development with a language model [G1].

We can first explore this dataset through the dataset navigation section and then initialize the process through the prompt variation section. This section allows the user to write several different prompt templates [T1] as well as answer templates for these prompts [T2]. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models (a) shows some examples of these templates and the way they use fields from the underlying data set and template variables q1 and q2. We select one of the better-performing prompts to investigate further in the prompt refinement section.

Figure 9(a) shows the output of the selected prompt for some data, and we observe that there is some confusion about the labels. Upon detailed inspection (see example in Figure 9(b)), we observe that the labels can be ambiguous even under human evaluation. So, we send the prompt to larger-scale testing [T3,G3].

The testing reveals the specific problem. The ground truth label “Technology” is not a great choice and gets confused with other phrases (Figure 9(c)). Upon inspection of the Top 5 Rank Predictions, we observe that the token “Science” is a very frequent wording for this ground truth, even more than the term “Technology” itself. We use this insight to refine the answer options in the refinement section by substituting “Technology” with “Science”. In this case, the wording of the answer changed, but other feedback would lead to changes in the wording of the prompt itself.

We run the testing again. After receiving the results, we see that the modification increased performance substantially (Figure 9(d)). We can now go to the shopping cart and export the new model [T4,G4].

5.2 Multiple-Choice Answering

The task of reading comprehension is to answer a question from a complex document. It can be used for ad-hoc models to find documents that provide answers to specific questions. The RACE dataset [14] is a multiple-choice reading comprehension dataset built from English examinations used for benchmarking this task. For a given text extract and a question about that text, the model has to choose the correct answer among four possibilities. Unlike document classification, the possibilities are different for each sample.

We explore prompts that introduce four answer choices, associating them with letters (A, B, C, and D): “Possible answers:”, “Choose between A, B, C and D:”, or nothing. Figure 10 shows that “Choose between A, B, C, and D:” consistently gives worse results than the two other variations, independently of how the input is introduced (q1). We discard this variation, and following [42], we try out prompts that explicitly state the instruction at the beginning of the prompt. We note that even though the performance remains the same for most of the prompt variations, some variations degrade the performance (for instance “Please select the correct answer among all of the options.”).

We select one of these prompts for refining and test it on 100 examples. From the confusion matrix and the most common top 5 predictions (Figure 11), we notice that the model is often predicting “E” even though it is not among the answer choices and that it tends to predict “C”, as it is always the second most frequent prediction when it is not the correct label. This observation hints at potential class imbalances in the training mixture of the underlying model and warrants more investigation. We find that the training mixture contains a variety of multiple-choice question answering datasets that contain from 3 (COS-E [30]) to 8 (QASC [13]) answer choices, which leads to the high frequency of labels A, B, and C.

Figure 10: Mapping dynamic answer choices to simple outputs such as letters can help the model by simplifying its output space.
Figure 11: The label “C” is systematically the second most frequent prediction of the model, which hints at a potential class imbalance in the training set of the underlying model.

The interactive nature of PromptIDE makes it easy to develop a lot of prompts, save them and export them to a different environment. Once the best prompt has been identified, the end-user can use them outside of PromptIDE, for instance, to deploy a prompted language model in production [G4]. The JSON export format makes the whole interactive development extremely versatile and compatible with most of the standard codebase.

5.3 Sentence Similarity

Figure 12:

Use case Natural Language Inference on RTE dataset 

[5]. Details in subsection 5.3
Figure 13: Using PromptIDE to study adversarial conditions by using misleading-extreme prompts from [41]. Details in subsection 5.3

The task of natural language inference is to determine the semantic relationship between two similar sentences. An ad-hoc domain expert would use a similar task to check whether documents agree with or contradict specific target statements. Recognizing Textual Entailment [5] is an inference dataset where a model is asked to assume that one piece of text (the “premise”) is true and to classify whether another piece of text (the “hypothesis”) must also be true, i.e., whether the premise “entails” the hypothesis. For example, if the premise is Steve Jobs was attacked by Sculley and other Apple executives for not delivering enough hot new products and resigned from the company a few weeks later. and the hypothesis states that Steve Jobs worked for Apple., then the data is labeled as “true” or “entailment”.

Past work [41] has used prompting for this task but found that it was difficult to find a prompt wording that performed robustly on different examples. Starting from the prompts in [41]

we can use PromptIDE for estimating the models’ robustness to different wordings in the templates. In particular, the prompt variable variants

allow us to easily control and test wording variations.

To start finding good prompts for RTE, we copy and paste the main instructive variations of [5] into q1. In previewing 20 examples, we see T0-3B’s performance across prompts ranges from a respectable 70% to 50% (Figure 12a). We add the best and the worst prompts to carts for further investigation under the “Refine Prompt” section. From the confusion matrix (Figure 12b), see that most mistakes are models under-predicting entailment for sentence pairs where the premises do, in fact, entail their hypotheses (i.e., ground truth is “Yes”).

Another noteworthy finding from the “Most Common Top 5 Predictions” is that, for the entailment class, the model generates “No” with a higher rank more frequently than “Yes” (Figure 12c), yet it also generates “yes” and “True” with higher rank more frequently than “no” and “False” (Figure 12d). Similar to subsection 5.1, we could simply change the desired target words for a performance boost. But we can also study the effect of explicitly providing the answer choices in the input sequence by appending “True or false?” to every template, which further improves performance for all templates. However, explicitly providing the answer choices in input sequences does not always control model behaviors in accord with human intuition. Prepending the answer choices at the beginning of input sequences does not confer any performance boost.

Finally, PromptIDE is also prime for studying adversarial conditions such as irrelevant and misleading prompts. For example, using the misleading-extreme prompts from [41], Figure 13a shows that “is this grammatically correct” outperforms just instructive prompts such as “does this imply that”. However, if we remove the question marks in the global template, then the misleading and irrelevant prompts consistently underperform the instructive ones (Figure 13

b), suggesting that models may be using question marks as some kind of heuristic feature.

6 Early Feedback and Lessons Learned

After internal deployment within the prompting expert group we received initial qualitative feedback from two colleagues. When being asked about most helpful PromptIDE features, they answered: “visualizing the best verbalizer in the confusion matrix form and the Most Common Top 5 Predictions format. Furthermore, it’s really helpful to preview the prompt variations and their performances and use refine to copy to the Prompt Refine section.”

Main criticism came from the limitation of prompt variant variables: “In agnews dataset, when I add q3 and q4, the Prompt variable variants field doesn’t have q4. …we could have some indicator of maximum prompts allowed.” Interestingly, there where also recommendations about UI improvements: “I am confused about the use of red underlines …I would prefer to use something other than red because red indicates something is wrong.” or simplifications like “I think using some ‘i’ popup icon for storing some information (with more descriptions) is much better than outright displaying technical texts like (frequency 5 avrg rank 5).”

The lesson we learned from working on this project are manifold with some of them are already indicated in subsection 4.4. The surprise to us was how much constructive feedback we got about the user interface during the development and after deployment, often started with an apology about being a non expert in UI. We recommend to establishing early on that all UI feedback is very welcome and not a overstepping in competences.

A constant fight for compromising between flexibility vs simplicity became a major task. E.g., determining how many prompt variation variables we should support (q1, q2, q3) vs how many can we handle with combinatorical explosion of variants.

Another lesson learned was that the more complex charts tend to be ignored whilst the simple charts with complex algorithmic ideas got much faster appreciated and accepted. E.g., the most common “Aha!” moment was caused by top 5 tokens idea in the prompt testing section.

7 Conclusions and Future Work

We present PromptIDE, a system for domain experts to customize models for ad-hoc tasks without requiring training expertise. The system adopts the prompting framework that has recently been developed for NLP tasks while developing an interactive visualization environment for customizing prompts for new tasks. The approach extends beyond brute-force prompt trial-and-error to facilitate the exploration of prompt language and the development of new prompt templates and answer choices. The system is open-source and can work with any available language model backend.

As the methodology of prompting develops, there are many areas of extension for PromptIDE. Currently, the tool supports tasks with a known set of choices, but there are many NLP tasks where the correct label may be a free-form response. These could be incorporated through support for new metrics such as BLEU in the prompt refinement section. PromptIDE also assumes that the user is able to deduce how to update prompts based on task metrics. Ideally, the system could provide direct advice on how to change prompts or highlight text spans that would lead to better results through methods like gradient saliency. These methods are currently too computationally expensive to run on large promptable models, but will likely improve with more research.

8 Acknowledgements

We would like to thank the anonymous reviewers for their constructive feedback and helpful comments. This work was partially funded through NSF grant IIS-1901030 and the MIT-IBM Watson AI Lab.


  • [1] J. Alammar. Interfaces for explaining transformer language models., 2020.
  • [2] S. H. Bach, V. Sanh, Z.-X. Yong, A. Webson, C. Raffel, N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Fevry, et al. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279, 2022.
  • [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [4] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  • [5] I. Dagan, O. Glickman, and B. Magnini. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Springer, 2006.
  • [6] J. F. DeRose, J. Wang, and M. Berger. Attention flows: Analyzing and comparing attention mechanisms in language models. IEEE Transactions on Visualization and Computer Graphics, 27(2):1160–1170, 2020.
  • [7] I. El Maarouf, Y. Mansar, V. Mouilleron, and D. Valsamou-Stanislawski. The finsim 2020 shared task: Learning semantic representations for the financial domain. In Proceedings of the Second Workshop on Financial Technology and Natural Language Processing, pp. 81–86, 2020.
  • [8] T. Gao, A. Fisch, and D. Chen. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816–3830. Association for Computational Linguistics, Online, Aug. 2021. doi: 10 . 18653/v1/2021 . acl-long . 295
  • [9] S. Gehrmann, H. Strobelt, and A. Rush. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 111–116. Association for Computational Linguistics, Florence, Italy, July 2019. doi: 10 . 18653/v1/P19-3019
  • [10] Github. Github copilot · your ai pair programmer.
  • [11] B. Hoover, H. Strobelt, and S. Gehrmann. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 187–196. Association for Computational Linguistics, Online, July 2020. doi: 10 . 18653/v1/2020 . acl-demos . 22
  • [12] T. Jaunet, C. Kervadec, R. Vuillemot, G. Antipov, M. Baccouche, and C. Wolf. Visqa: X-raying vision and language reasoning in transformers. IEEE Transactions on Visualization and Computer Graphics, 28(1):976–986, 2021.
  • [13] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal. Qasc: A dataset for question answering via sentence composition. arXiv:1910.11473v2, 2020.
  • [14] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. doi: 10 . 18653/v1/D17-1082
  • [15] T. Le Scao and A. Rush. How many data points is a prompt worth? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2627–2636. Association for Computational Linguistics, Online, June 2021. doi: 10 . 18653/v1/2021 . naacl-main . 208
  • [16] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In M. Moens, X. Huang, L. Specia, and S. W. Yih, eds., Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 3045–3059. Association for Computational Linguistics, 2021. doi: 10 . 18653/v1/2021 . emnlp-main . 243
  • [17] Q. Lhoest, A. V. del Moral, Y. Jernite, A. Thakur, P. von Platen, S. Patil, J. Chaumond, M. Drame, J. Plu, L. Tunstall, J. Davison, M. Sasko, G. Chhablani, B. Malik, S. Brandeis, T. L. Scao, V. Sanh, C. Xu, N. Patry, A. McMillan-Major, P. Schmid, S. Gugger, C. Delangue, T. Matussière, L. Debut, S. Bekman, P. Cistac, T. Goehringer, V. Mustar, F. Lagunas, A. M. Rush, and T. Wolf. Datasets: A community library for natural language processing. In H. Adel and S. Shi, eds., Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2021, Online and Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 175–184. Association for Computational Linguistics, 2021. doi: 10 . 18653/v1/2021 . emnlp-demo . 21
  • [18] Q. Lhoest, A. Villanova del Moral, Y. Jernite, A. Thakur, P. von Platen, S. Patil, J. Chaumond, M. Drame, J. Plu, L. Tunstall, J. Davison, M. Šaško, G. Chhablani, B. Malik, S. Brandeis, T. Le Scao, V. Sanh, C. Xu, N. Patry, A. McMillan-Major, P. Schmid, S. Gugger, C. Delangue, T. Matussière, L. Debut, S. Bekman, P. Cistac, T. Goehringer, V. Mustar, F. Lagunas, A. Rush, and T. Wolf. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 175–184. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, Nov. 2021. doi: 10 . 18653/v1/2021 . emnlp-demo . 21
  • [19] R. Li, W. Xiao, L. Wang, H. Jang, and G. Carenini. T3-vis: visual analytic for training and fine-tuning transformers in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 220–230. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, Nov. 2021. doi: 10 . 18653/v1/2021 . emnlp-demo . 26
  • [20] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In C. Zong, F. Xia, W. Li, and R. Navigli, eds., Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4582–4597. Association for Computational Linguistics, 2021. doi: 10 . 18653/v1/2021 . acl-long . 353
  • [21] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. CoRR, abs/2107.13586, 2021.
  • [22] S. Liu, Z. Li, T. Li, V. Srikumar, V. Pascucci, and P.-T. Bremer. Nlize: A perturbation-driven visual interrogation tool for analyzing and interpreting natural language inference models. IEEE transactions on visualization and computer graphics, 25(1):651–660, 2018.
  • [23] A. Mayne. Building games and apps entirely through natural language using openai’s code-davinci model. Andrew Mayne Blog, Mar 2022.
  • [24] Y. Ming, S. Cao, R. Zhang, Z. Li, Y. Chen, Y. Song, and H. Qu.

    Understanding hidden memories of recurrent neural networks.

    In 2017 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 13–24. IEEE, 2017.
  • [25] Open AI. OpenAI playground.
  • [26] A. Pearce. What have language models learned? VISxAI Workshop 2021,, Jul 2021.
  • [27] E. Perez, D. Kiela, and K. Cho. True few-shot learning with language models. NeurIPS, 2021.
  • [28] O. Popescu and C. Strapparava. Natural language processing meets journalism. In Proceedings of the 2017 EMNLP Workshop. Copenhagen, Denmark: Association for Computational Linguistics.–2017, 2017.
  • [29] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [30] N. F. Rajani, B. McCann, C. Xiong, and R. Socher. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 2019 Conference of the Association for Computational Linguistics (ACL2019), 2019.
  • [31] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, S. Biderman, L. Gao, T. Bers, T. Wolf, and A. M. Rush. Multitask prompted training enables zero-shot task generalization, 2021.
  • [32] T. Schick and H. Schütze. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352. Association for Computational Linguistics, Online, June 2021. doi: 10 . 18653/v1/2021 . naacl-main . 185
  • [33] T. Shin, Y. Razeghi, R. L. L. IV, E. Wallace, and S. Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In B. Webber, T. Cohn, Y. He, and Y. Liu, eds., Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 4222–4235. Association for Computational Linguistics, 2020. doi: 10 . 18653/v1/2020 . emnlp-main . 346
  • [34] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush.

    Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks.

    IEEE transactions on visualization and computer graphics, 24(1):667–676, 2017.
  • [35] H. Strobelt, B. Hoover, A. Satyanaryan, and S. Gehrmann. LMdiff: A visual diff tool to compare language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 96–105. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, Nov. 2021. doi: 10 . 18653/v1/2021 . emnlp-demo . 12
  • [36] I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann, E. Jiang, M. Pushkarna, C. Radebaugh, E. Reif, and A. Yuan. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 107–118, 2020.
  • [37] C. Turkay, N. Pezzotti, C. Binnig, H. Strobelt, B. Hammer, D. A. Keim, J.-D. Fekete, T. Palpanas, Y. Wang, and F. Rusu.

    Progressive data science: Potential and challenges, 2018. doi: 10 . 48550/ARXIV . 1812 . 08032

  • [38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [39] J. Vig. A multiscale visualization of attention in the transformer model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 37–42. Association for Computational Linguistics, Florence, Italy, July 2019. doi: 10 . 18653/v1/P19-3007
  • [40] Z. J. Wang, R. Turko, and D. H. Chau. Dodrio: Exploring transformer models with interactive visualization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 132–141. Association for Computational Linguistics, Online, Aug. 2021. doi: 10 . 18653/v1/2021 . acl-demo . 16
  • [41] A. Webson and E. Pavlick. Do prompt-based models really understand the meaning of their prompts? ArXiv, abs/2109.01247, 2021.
  • [42] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. CoRR, abs/2109.01652, 2021.
  • [43] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  • [44] X. Zhang, J. J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In NIPS, 2015.
  • [45] T. Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh. Calibrate before use: Improving few-shot performance of language models. CoRR, abs/2102.09690, 2021.
  • [46] S. Zhidkov. Advanced GPT-3 playground.