Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue

by   Moya Chen, et al.

We demonstrate that large language models are able to simulate Task Oriented Dialogues in novel domains, provided only with an API implementation and a list of goals. We show these simulations can formulate online, automatic metrics that correlate well with human evaluations. Furthermore, by checking for whether the User's goals are met, we can use simulation to repeatedly generate training data and improve the quality of simulations themselves. With no human intervention or domain-specific training data, our simulations bootstrap end-to-end models which achieve a 37% error reduction in previously unseen domains. By including as few as 32 domain-specific conversations, bootstrapped models can match the performance of a fully-supervised model with 10× more data. To our knowledge, this is the first time simulations have been shown to be effective at bootstrapping models without explicitly requiring any domain-specific training data, rule-engineering, or humans-in-the-loop.



There are no comments yet.


page 13


Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog

Recent studies have shown remarkable success in end-to-end task-oriented...

Adding Chit-Chats to Enhance Task-Oriented Dialogues

The existing dialogue corpora and models are typically designed under tw...

Automatic Expansion of Domain-Specific Affective Models for Web Intelligence Applications

Sentic computing relies on well-defined affective models of different co...

Domain-independent User Simulation with Transformers for Task-oriented Dialogue Systems

Dialogue policy optimisation via reinforcement learning requires a large...

Multi-Domain Spoken Language Understanding Using Domain- and Task-Aware Parameterization

Spoken language understanding has been addressed as a supervised learnin...

Key-Value Retrieval Networks for Task-Oriented Dialogue

Neural task-oriented dialogue systems often struggle to smoothly interfa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Virtual Assistants have become ubiquitous in modern life Acharya et al. (2021). However, building these Task Oriented Dialogue (TOD) systems is laborious, requiring significant data collection and engineering resources to add support for a novel domain. As such, methods which can generalize, learn from limited examples, and require fewer engineering resources are highly desirable Shi et al. (2019); Shah et al. (2018); Acharya et al. (2021).

Figure 1: Illustration of our Simulation system. Initially, the User model is given a Goal API call and the Assistant is optionally provided an API Schema as grounding. Both models engage in dialogue until the User terminates the conversation. A dialogue is successful if the Assistant generates the correct API call by the end of the dialogue. In-arrows designate inputs to an entity; out-arrows designate what it generates.

To this end, works have previously identified User Simulators, wherein a model is used to emulate a human user in place of a real one, as a means of addressing these problems. User Simulators have been used to evaluate Walker et al. (1997, 2000); Schatzmann et al. (2005) and improve Assistant models by providing additional training data Shah et al. (2018); Acharya et al. (2021)

and reward signals for Reinforcement Learning methods

Fazel-Zarandi et al. (2017); Su et al. (2018); Shi et al. (2019). Typically, these User Simulators are either limited to enhancing existing domains Fazel-Zarandi et al. (2017) or utilize specialized and manually engineered rules or templates for novel domains Shah et al. (2018); Shi et al. (2019). User Simulators have often required post-hoc human intervention to ensure quality Shah et al. (2018).

In the first part of this work, we show that modern Large Language Models Radford et al. (2018, 2019); Lewis et al. (2020) are capable of reasonably generating dialogues when equipped with an API implementation and a desired User goal. We observe the quality of these dialogues increases with the quality of the models. Furthermore, we observe that simulation success is a strong discriminator of Assistant performance and dialogue quality.

In the second part of this work, we describe a method for bootstrapping User and Assistant models for previously unseen dialogue domains. We use Task Success, which can be automatically measured in fully synthetic dialogues, to discriminate between high- and low-quality dialogues. By adding successful dialogues back into the training set, retraining the model, and repeating this procedure, we bootstrap an Assistant model without the use of any domain-specific training data, hand-engineered rules, Natural Language templates, or humans-in-the-loop. Our methodology shows improvements in both zero-shot and full-shot settings.

Furthermore, we show that we can use Task Success as a method for automatically identifying the weakest skills of our model, and employ Active Learning

Tur et al. (2003); Olsson (2009) to enhance performance. By additionally including as few as 32 domain-specific training examples, we can match the performance of a fully-supervised baseline provided with more data.

We open source our simulation infrastructure as part of the ParlAI framework

Miller et al. (2017).111https://parl.ai/projects/tod_simulator/

2 End-to-End TOD Simulators

A high-level illustration of our simulation system is shown in Figure 1. Our simulation system consists of three main components: a User model, an Assistant model, and an API Implementation. A conversation consists of repeated turns between the User, the Assistant, and the API Implementation. Although traditional TOD systems model conversations with a combination of intent detection, belief state tracking, and policy Jurafsky and Martin (2009), we employ a more modern setup Rastogi et al. (2019) where the Assistant must both make API calls at the right time and translate returned API responses into Natural Language utterances for the User. This formulation is particularly amenable to modern End-to-End (E2E) approaches based on pretrained Language Models Ham et al. (2020); Peng et al. (2020); Hosseini-Asl et al. (2020).

In order to guide the conversation, the User is given a Goal as its first turn. This Goal consists of a complete API call (e.g. intent, slot names, and slot values) serialized as a string. This is explicitly not shown to the Assistant model, to prevent the Assistant from seeing which API call it must make. In each following turn, the User provides a natural language utterance to the Assistant. The Assistant optionally generates a serialized API call string and sends it to the API Implementation. If a successful API call is made, the API Implementation returns a serialized API response to the Assistant; a sentinel value is returned for any failed calls. The Assistant generates a natural language utterance back to User after receiving an API response.

A conversation continues in this repeated fashion, with turns of User utterance Assistant API call API Implementation response Assistant utterance, until the User generates a ‘[DONE]’ token or a maximum number of turns is exceeded. A conversation is said to be successful if the Assistant generates the correct API call (i.e. one equivalent to the Goal given to the User) before the end of the conversation. In order to have a successful conversation, we expect the User must ground its generations on the Goal, and the Assistant to ground its generations on the API response. We will later show that a simulation’s Task Success Rate (TSR), or its success averaged over a large number of goals, acts as a strong proxy for the quality of dialogues generated Schatzmann et al. (2005).

In addition to Goals being shown to the User agent, our system optionally allows the Assistant to be shown an API Schema on the first turn. An API Schema consists of the signature of the Goal (e.g. intent and slot names) without any realized values. As we will later see, Schemas (combined with Task Success) enable us to bootstrap models in unseen domains, but Schemas are not used during any evaluations: doing so would leak the intent of the User and reduce the difficulty of the task.

3 Related Work

User Simulators have a long and successful history spanning many years, having been utilized widely for both evaluation Walker et al. (1997); Schatzmann et al. (2005); Ai and Weng (2008); Jung et al. (2009); Crook and Marin (2017), and Assistant improvements Li et al. (2016); Su et al. (2018); Shi et al. (2019). Their formulations have varied included Rule-based Shi et al. (2019), Agenda-based Schatzmann et al. (2007); Shah et al. (2018), and End-to-End Asri et al. (2016); Crook and Marin (2017); Shi et al. (2019) approaches.

A wide range of works have explored using User Simulators as a proxy for Assistant evaluation Schatzmann et al. (2005) or predicting real user satisfaction Walker et al. (2000); Ai and Weng (2008); Jung et al. (2009); Li et al. (2016); Crook and Marin (2017). Performance of simulators are often measured either via Task Success Rate Gür et al. (2018); Kreyssig et al. (2018) or by cross-examining them against a wide-variety of Assistant systems Schatzmann et al. (2005). Previously many have explored the use of Language Models (LMs) as User Simulators, but observed models had poor adherence to goals Georgila et al. (2006); Crook and Marin (2017). Accordingly, most simulators have been Agenda-based simulators, which follow templates and agendas to fulfill their goals Schatzmann et al. (2007); Shah et al. (2018). In contrast, we find that modern pretrained Language Models are able to ground on and follow goals very well (Section 4).

Engaging User Simulators and Assistant systems in synthetically generated dialogues leads to a natural reward function, and many works have used simulators in order to optimize a Reinforcement Learning policy Schatzmann et al. (2007); Fazel-Zarandi et al. (2017); Peng et al. (2017); Su et al. (2018); Gür et al. (2018); Kreyssig et al. (2018). Such approaches are particularly used for optimizing the policy component of pipeline-based systems Fazel-Zarandi et al. (2017), and frequently rely on the use of Natural Language Generation (NLG) templates over dialogue acts Fazel-Zarandi et al. (2017); Shah et al. (2018); Shi et al. (2019); Kreyssig et al. (2018); Acharya et al. (2021). Our work instead utilizes fully lexicalized, E2E models for both the User and the Assistant models, without the need for agendas, dialogue acts, or NLG templates.

Most closely related to our work is that of Shah et al. (2018) and Acharya et al. (2021). Similar to our approach, both use User Simulators and Schemas to generate synthetic data and add them to a training dataset. However, Shah et al. (2018) uses an additional stage of human-annotation in order to identify failed dialogues and paraphrases, while we show that Task Success can be a sufficient metric without Human involvement; we also utilize sampling to achieve diversity Holtzman et al. (2020). Furthermore, Shah et al. (2018) perform only a single iteration of simulation, while we demonstrate improved performance from multiple iterations. Acharya et al. (2021) generalizes in a few-shot setting using a similar loop as ours, but rely on templates for NLG and human paraphrases. In contrast, we show we are able to gain performance improvements without dialogue acts or human paraphrasing, but we also show significant improvements via few-shot annotations using Active Learning.

End-to-End systems for TOD have had a surge in interest recently Asri et al. (2016); Bordes et al. (2016); Liu and Lane (2018); Rastogi et al. (2019); Ham et al. (2020); Hosseini-Asl et al. (2020); Peng et al. (2020); Lin et al. (2020), owing to the success of utilizing pretrained models Devlin et al. (2019); Radford et al. (2019); Lewis et al. (2020). E2E models promise to lower the cost of annotation by replacing traditional pipeline models with text-in-text-out and adjacent external API calls Rastogi et al. (2019); Byrne et al. (2021). Compared to these works, we leverage pretrained models for User models in order to produce simulations, rather than only modeling Assistants. We also leverage Schema-based grounding techniques Rastogi et al. (2019); Balaraman and Magnini (2021), which may be viewed as a form of in-context prompting methods Brown et al. (2020); Schick and Schütze (2020); Wang et al. (2021); Wei et al. (2021).

4 Evaluating Synthetic Dialogue Generation and Related Metrics

We experiment to validate the appropriateness of our E2E TOD Simulator setup. We generate synthetic conversations with a variety of different model architectures and validate that traditional automated offline metrics, our TSR metric, and human evaluation correlate positively and inline with general expectations. Note that our objective is to verify that modern end-to-end models are able to produce reasonable synthetic conversations using only goals, and not to obtain State-of-the-Art performance. While we expect more ‘powerful’ model architectures to perform better than weaker ones, we primarily aim to validate the directional correlation of metrics across models.

4.1 Experimental Setup


We focus our efforts on the Google Schema Guided Dialogue (Google SGD) dataset Rastogi et al. (2019). Google SGD is a large TOD dataset with emphasis on zero-shot incorporation of new skills: models may make Service Calls (API Calls) which return responses. As part of the data, models may have access to API Schemas, which contain call signatures of each of the possible services and intents. The held-out test set contains several services and intents which are not shown in the training data. In this section, we do not employ the API Schemas. Instead, we train Assistant models that must memorize the underlying Schemas directly from the training data.


We experiment with four model architectures: LSTM Hochreiter and Schmidhuber (1997), LSTM with Attention Bahdanau et al. (2014), GPT2 Radford et al. (2019), and BART Lewis et al. (2020) with R3F Aghajanyan et al. (2021). We expect models later in this list to be more powerful. Our GPT2 implementation closely resembles the setup of SimpleTOD Hosseini-Asl et al. (2020), while our BART model roughly mirrors the implementation of MinTL Lin et al. (2020).

We fine-tune on Google SGD , using the original splits from Rastogi et al. (2019). For each model architecture type, we fine-tune separate User and Assistant models. We generate synthetic conversations with the setup described in Section 2. We extract goals from the validation dataset, leaving out conversations with multiple goals; each conversation is seeded with a single API call. We mock API Implementations via a lookup table with fully realized API calls as keys and corresponding API responses as values; this lookup table is populated directly from the dataset, and we return a sentinel failure value if the Assistant requests any API call not in the dataset.

Evaluation Metrics

We report two metrics for the Assistant: Joint Goal Accuracy (JGA)222We deviate from original Google SGD evaluation here and report JGA on the API calls rather than the belief state. An example receives a JGA score of 1 iff it generates the exact API call perfectly. Note that the majority of turns do not have API calls, so majority baseline is about 0.71.

and BLEU score. We additionally evaluate our simulation quality by Task Success Rate over the goals of the Valid and Test sets. In Google SGD, the Test set contains Out-of-Domain (OOD) examples that do not appear in the Train or Valid sets. As such, considering both Valid and Test gives us some estimate of OOD performance. While we aim to ensure that the models are well trained, e.g. comparable to results in the literature, our main objective to examine the correlation between Task Success measured on synthetic data and established offline metrics.

Human evaluation

We use ACUTE-Eval Li et al. (2019) for human evaluation. ACUTE-Eval is a pairwise evaluation in which an annotator is shown two dialogues and asked a question about which they prefer. We ask annotators the question “Which Assistant would you rather use yourself?” As recommended by Li et al. (2019), we use a manually-curated control pair comparing an artificially repetitive dialogue with a gold dialogue from Google SGD; annotators who failed to identify the gold dialogue were removed. We select a random subset of 400 goals derived from the Valid set of Google SGD and use the same model architecture for both the User and Assistant to generate synthetic conversations. We only present User and Assistant utterance turns to annotators (hiding any API calls). We collect pairwise annotations between each model architecture described above, as well as the gold dialogues from the original dataset (Human). Annotators were presented with conversations with the same goal when comparing model-generated conversations. We measure the fraction of times each model architecture (or Human) was preferred in its pairwise match up, and compare all possible pairs of model architectures. An image of the annotator UI is included in Appendix A.1.

4.2 Results

Automatic evaluations

User Assistant Simul. TSR
Model BLEU JGA BLEU Valid Test
LSTM .058 .777 .123 .042 .042
LSTM+Attn .078 .833 .183 .302 .169
GPT2 .093 .869 .223 .474 .307
BART .116 .897 .252 .583 .352
Table 1: Fully automatic metrics of different modeling approaches tested on the original Google SGD split. Simulation Task Success Rate (TSR) increases along with offline metrics, but shows greater discrimination in magnitude than offline metrics


Results are shown in Table 1. We find that performance of all metrics improves monotonically in the direction we expect: more modern models with better pre-training and regularization outperform others. We find that the magnitude of improvements from better modeling is more visible using either measure of TSR compared to using more traditional offline metrics

Human Evaluation

Win %

Lose %

LSTM .48 .55 .57 .80
Attn .52 .60 .63 .83
GPT2 .45 .40 .58 .81
BART .43 .37 .42 .75
Human .20 .17 .19 .25
Figure 2: Pairwise Human Evaluations of simulations by the different models, along with Human conversations. Entries marked in blue agree with automatic metrics, while those in red italics disagree with the automatic metrics. Bold numbers indicate statistical significance (, binomial test). LSTM models hallucinates realistic dialogues that ignore their goal.

The results of our human evaluation are shown in Figure 2. We label the scores depending on whether they agree or disagree with our expectations from automatic metrics. We find only the LSTM v. LSTM-Attn pair disagrees with our expectations, while all other pairwise evaluations agree with our expectations. We also see that Humans are greatly preferred over all the simulations, indicating none of our simulations are at human-level performance. However, we additionally note that preference for gold data roughly decreases as the quality of the system improves.

We manually inspect training logs and reasons for human preferences. In the UI annotators were asked to provide commentary on their selections. As would be expected, successfully helping the User was often a provided reason for preferring one Assistant over another; conciseness, naturalness, and brevity were also mentioned. Of particular note, we observe that the LSTM model generated only a few unique dialogues, and generally completely ignored its goal; though it had very low TSR, LSTM utterances would declare successful task completion. As a result, the LSTM essentially ‘hallucinates its way to success’ since human annotators do not see the goals. On the other hand, the GPT2 and BART models ground strongly on the goals, and generate plausible dialogues for a given goal. In particular, we observe that due to redundancy in slot names across domains, stronger models are able to successfully perform some simulations of completely unseen domains. This ability to generalize is tested rigorously in the next section.

5 Bootstrapping Novel Domains

In the previous section, we demonstrated that User models can be adequately grounded on unseen goals to guide our simulations at generating plausible synthetic dialogues, possibly even on unseen domains. In this section, we consider whether this synthetic data can be used to bootstrap models on completely novel domains.

At a high level, our approach depends on generating synthetic data, filtering the synthetic dialogues using Task Success, and re-training the simulation models while incorporating the synthetic dialogue. This process may be repeated for multiple iterations to form a feedback loop.

Pretraining Data
Google SGD Rastogi et al. (2019)
MultiWoz Budzianowski et al. (2018)
MSR-E2E Li et al. (2018)
MetaLWoZ Lee et al. (2019)
Taskmaster-1 Byrne et al. (2019)
Taskmaster-2 Byrne et al. (2020)
TicketTalk Byrne et al. (2021)
MultiDoGo Peskov et al. (2019)
Table 2: Pretraining datasets. For Google SGD, we use new splits described in Section 5.1.
Fold No. Diag. No. Domains
Pretraining (train) 119,677 29
In-Domain train 13,888 16
In-Domain valid 1,966 16
In-Domain test 3,132 16
Out-of-domain train 2,303 4
Out-of-domain valid 768 4
Out-of-domain test 768 4
Table 3: Dataset statistics Datasets used for simulator pre-training and evaluation. In-Domain and Out-of-Domain refer to new splits described in 4.1. Pretraining statistics include those of In-Domain Google SGD. The Out-of-Domain train fold is used to sample goals and train baselines, and not for pretraining.

5.1 Experimental Setup

Pretraining & Data setup

Following the work of Soloist Peng et al. (2020) and TOD-BERT Wu et al. (2020), we pretrain a BART model on a large number of open source Task Oriented Dialogue datasets. In early experimentation, we found pretraining improved zero-shot Out-of-Domain JGA performance by about 6 absolute points, and initialize with this pretrained model for all baselines and proposed models. A complete list of the datasets used is shown in Table 2. For brevity, full descriptions of each dataset and relevant preprocessing may be found in Appendix A.2.

To make sure that our out of domain bootstrapping of Google SGD is truly out of domain relative to pretraining, we build two custom splits for Google SGD. We analyze domains present across all of our datasets and select four holdout domains that are unique to Google SGD. We denote all conversations that use any of these holdout domains as part of the Out-of-Domain split and all remaining conversations to be the In-Domain split. Note that as Google SGD is a dataset which has multiple domains in a given conversation, some non-holdout appear as part of conversations in the Out-of-Domain split. Only In-Domain Google SGD is used for pretraining. All domains are listed in the Appendix. Final statistics of pretraining and evaluation data are provided in Table 3.

Bootstrapping Procedure

We use a Schema-Aware models in order to generate synthetic training data, as find that using schema aware models is necessary for zero-shot and few-shot domain generalization. We use goals from the Train split of Out-of-Domain Google SGD, to generating grounding for synthetic training data. In order to increase data diversity and prevent overfitting, we use Nucleus generation (Holtzman et al., 2020; ). We generate 20 synthetic conversations for a given goal and retain only successful conversations. These successful conversations make up the synthetic data that we use for fine-tuning, with 10% withheld and used for model selection.

We fine-tune both Schema-Aware (for data generation) and Schema-Agnostic (for evaluation) versions of our models on this synthetic data. Recall that Schema-Aware models have the User intent leaked to them, and therefore only Schema-Agnostic models may be used for evaluation. Models are fine-tuned incrementally – the best model from the previous iteration acts as initialization for the next iteration – and synthetic data is accumulated across iterations. During early experimentation, we found that fine-tuning a single model on both User and Assistant roles generally performed better than fine-tuning separate models and use this multitask setup for all of our experiments. Additionally, we find that multitasking on In-Domain data alongside synthetic data helps prevent overfitting, and include it in all experimental conditions.

Experimental Comparisons

We perform multiple experimental comparisons for models with synthetic data.

In our first experiment, we compare the performance of Schema-Aware and Schema-Agnostic models. This is to demonstrate that providing Schemas boosts performance and raises the Task Success Rate, enabling generation feedback loops.

In our second experiment, we consider how simulations may be used to bootstrap models in a Zero-shot setting. In these experiments, we only provide Out-of-Domain information via goals and completely synthetic data. To show the improvement provided by simulations, we compare primarily against the Base pretrained model, which has never seen any Out-of-Domain information. To contextualize the result, we also provide results for a Fully-Supervised model upper baseline, which was fine-tuned directly on all available Out-of-Domain data.

As models ‘in the wild’ would not have access to Schemas (i.e. they must detect user intent), we only evaluate the Schema-agnostic versions, and report offline metrics of Out-of-Domain JGA and Assistant BLEU-4, as well as the online metric TSR. We also report offline In-Domain JGA to ensure the use of synthetic data does not harm existing knowledge in the model.

In our third experiment, we consider how simulations enable a form of Active Learning. In these setups, we intentionally add domain-specific training data. At each iteration, we evaluate performance of the model across all the goals, and identify the 8 Schemas with the lowest overall performance. We select 8 conversations from the Out-of-Domain training set matching these goals, and add them into the training data in the next iteration (in addition to the synthetic data). We also select 8 conversations from the validation set and add them to the validation data. This method can be seen as a form of Active Learning, where model performance by goals is used to guide data collection schemes at a lower total cost. As a comparison point, we evaluate models trained with an equal number of randomly chosen samples, which demonstrate the performance of few-shot modeling without the use of simulation methodologies. To contextualize performance, we also compare to a model which uses more few-shot samples, and the fully supervised model.

We analyze these different model conditions in more granularity in Section A.5.

In our final experiment, we ensure we have not inadvertently overfit on the validation goals through accidental leakage. We test all the above models on the held-out test set, which contains entirely unseen goals and conversations. Additionally, we evaluate whether our bootstrap procedure can be used in data-rich environments by applying it on top of a fully-supervised model.

Figure 3: Bootstrapped Performance with no Domain-specific data on validation data. Out-of-Domain JGA and synthetic TSR both improve dramatically through only the use of synthetically generated data. Assistant BLEU and In-Domain JGA are unaffected by the synthetic data.

5.2 Results

Use of Schemas

Results of our first experiment are shown in Table 4. Across both In-Domain and Out-of-Domain, we see a substantial rise in performance in Schema-Aware models. This is unsurprising, as providing Schemas essentially cheats, allowing the model to bypass Intent detection. However, providing Schemas has a dramatic effect on Out-of-Domain performance, boosting JGA well above baseline performance, and enabling a non-zero Task Success Rate. Such successful conversations form the basis of our synthetic data during bootstrapping, and thus critical to our methodology.

In-Domain Out-of-Dom
Schema-Agnostic .878 .292 .777 .000
Schema-Aware .960 .839 .880 .369
Table 4: Comparing BART models with and without access to Schemas on Valid metrics. Schemas help guide the Assistant model to making the correct API calls in novel domains, as shown in TSR.


Results for our Zero-shot experiments are shown in Figure 3, with additional metrics provided in the Appendix A.4. Overall, we find that Out-of-Domain JGA performance goes up by a total of 8.3 absolute points, or about a 37% reduction in total errors, despite having no access to domain-specific data. However, JGA plateaus after just one iteration of simulation training, and further iterations provide marginal negative value.

On the other hand, TSR continues to improve for 3 iterations before eventually plateauing at approximately the level of the Fully-Supervised model; this suggests that JGA and TSR are no longer coupled, and that our bootstrapping procedure primarily optimizes its selection criteria: Task Success. To ensure the model was not simply making random API guesses to maximize TSR, we counted the number of API calls in simulation, and found the model converged on approximately 1 call per dialogue, matching the desired distribution.

We also find that Assistant BLEU does not change significantly compared to baseline, suggesting that improvements to Task Success do not translate to improvements in Natural Language Generation. Finally, we see that In-Domain JGA remains unchanged relative to the baseline, demonstrating that the addition of synthetic data does not come at the cost of performance in other domains.

Figure 4: Bootstrapped Validation Performance with Active Learning. Active Learning vastly outperforms a model with an equivalent number of few-shot samples. JGA and TSR match performance of a baseline with more domain-specific samples.

Active Learning

Results for our Active Learning experiments are shown in Figure 4. Contrary to the Zero-shot models, we see that JGA performance consistently improves for 3 iterations, finishing with a total of 13.4 points over the Zero-shot baseline and matching the performance of Few-shot model with 320 dialogues (in green), the number available to the Active Learning model. The Active Learning model also shows a large gain over the sample-equivalent Few-shot model (dashed orange), and that the gain increases with the number of samples. These results demonstrates that our use of Task Success strongly improves our sample-efficiency. TSR performance continues to improve, and eventually exceeds the fully-supervised model that was trained with more data.

Although not shown, we find that Assistant BLEU, as in our Zero-shot experiments similarly does not significantly improve. This indicates that TSR is more strongly correlated with JGA, and its optimization primarily benefits NLU. In-Domain JGA also remains flat, confirming that synthetic data does not lower existing performance. Additional metrics are provided in Appendix A.4.

Test Set Results

To ensure that our methodology did not inadvertently overfit via the leaking of goals, we report final Test Set performance for a fully held-out set of Out-of-Domain data; for Simulation-based models, we evaluated models after 4 iterations. Results are shown in Table 5. We find that results are consistent with our earlier analysis. Zero-shot JGA improves 9 points over the baseline, and Active Learning gains 14 points over the baseline. Task Success Rate shows larger improvements, and matches the Fully-supervised baseline. Both models outperform the Few-shot only baseline. Finally, we see that our simulation procedure remains useful even in data-rich environments: adding simulations to a fully supervised model improves JGA by 0.4 absolute points (15% error reduction).

In-Domain Out-of-Dom
Base model .829 .394 .770 .000
Simulation .838 .454 .860 .779
Few-shot only () .835 .459 .852 .140
Active Learning () .830 .362 .911 .799
Fully Supervised .895 .551 .973 .769
Fully Sup. + Simulation .895 .555 .977 .847
Table 5: Test set results. Final performance on the held out Test-set for both In-Domain and Out-of-Domain.

Human Evaluation

We perform a final human evaluation using each the models from our experimental conditions. We repeat the ACUTE-Evals described in Section 4.2, using synthetic dialogues from each condition. Results are shown in Figure 5.

We find that all human judgements are roughly consistent with our expectations from offline evaluation, with clear wins for Active Learning over the Baseline and Few-shot models. In all cases, the Human gold data significantly outperforms all of our models, indicating further avenues for improvement. Nonetheless, the win rate of Humans decreases in our models compared to the Baseline and Few-shot models.

In analysis of annotator preferences, we find that annotator selection is generally well-correlated with TSR: Annotators preferred unsuccessful conversations over successful conversations in only about 10% of pairings. Simplicity and clarity were oftentimes given as rationale for preference in these pairings: annotators generally preferred Assistants that had clear communication. Many of the models learned to ask confirmation questions; this was generally liked by the Annotators as long as there was not too much back and forth between the User and Assistant models in doing so. Some Annotators even preferred generated conversations with confirmations over the gold, human conversations. While some Annotators preferred “friendlier” or “more conversational” Assistants, this was not a consistent preference; some Annotators found similar conversations to be “weirdly informal” or to “take too long to get to the point.”

Win %

Lose %

Base Model .75 .74 .74 .91
Few-shot only .25 .75 .67 .71
Zero-Shot Sim. .26 .25 .46 .58
Active Learning .26 .33 .54 .69
Human .09 .29 .42 .31
Figure 5: Pairwise Human Evaluations of bootstrapped models. Entries marked in blue agree with automatic metrics, while those in red italics disagree with the automatic metrics. Bold numbers indicate statistical significance (, binomial test).

6 Limitations

Figure 6: Example Synthetic Conversation generated by our final Active Learning model, using Greedy generation. Although the conversation is successful, the linguistic variation is low.

While our bootstrapping and Active Learning procedures do significantly improve the robustness of models, they do surprisingly little to affect linguistic diversity, especially when using greedy generation. As a representative example, see Figure 6. In it, we observe that the conversation devolves to a simple slot-filling questionnaire, with the User beginning many utterances with “I need…” Manually reviewing simulated conversations found that while hallucination is very low in greedy-generated dialogues, most dialogues form roughly the slot-filling questionnaire pattern. While the synthetic conversations are plausible, their linguistic diversity is extremely low, explaining why our Task Success Rates reach near perfect levels: the User learns to specify things as simply as possible. Furthermore, we find one of our domains (Make payments) has multiple instances of infinite-loops being generated, a common issue known in Neural Language Models Holtzman et al. (2020); Welleck et al. (2019).

Figure 7: Example Synthetic Conversation using Nucleus generation. Linguistic diversity is improved, but the Assistant hallucinates information (in orange).

We also show an excerpt from a similar dialogue except with Nucleus sampling in Figure 7. This excerpt from the Rental Cars domain is mostly representative of the synthetic data generated for our training runs. We see increased linguistic variation, but we also observe that the Assistant model hallucinates: the API response does not say how many cars are available (it only provides one), and rental cars are unlikely to pick up their clients (likely learned from the Ride Share domain). Upon further inspection, we discovered this problem was ubiquitous for the Base model, but only present in limited quantities on bootstrapped models – suggesting retraining actually improves this behavior.

These examples, along with the lack of improvements in NLG metrics (BLEU scores), show that Task Success is likely to reduce the linguistic diversity of Assistants or Users, unless generation methods prone to hallucination are used. In the future, identifying automatic-filtering techniques for NLG utterances, similar to Task Success and slot filling, could help with this problem. Other generation methods, like Diverse Beam Search Vijayakumar et al. (2016), may also be able to perform a better hallucination-diversity trade-off. Alternatively, the model could only be trained on the API Call turns of synthetic data. Nonetheless, the offline metrics on the Test set demonstrate that our methodology does improves robustness of the Natural Language Understanding components of our model, as indicated by the Out-of-Domain JGA metrics.

7 Conclusion

We explored the use of pretrained Language Models as User Simulators in order to generate synthetic dialogues, and filter these models for quality using Task Success. We demonstrated our methodology can be used to improve models in zero-shot, few-shot, and full-shot manners. By incorporating Active Learning, we additionally show that our models are able to bootstrap NLU performance to that of a model with more training data. We encourage future work to look for improved generation methods which improve diversity without hallucination, and to find methods for automatically grading the quality of generations. Other improvements, such as the use of Schema Descriptions Rastogi et al. (2019); Lin et al. (2021), may provide further generalization on unseen domains.


Thank you to members of the Facebook dialogue teams for their helpful feedback and suggestions on this project. We are particularly grateful to Justin Cho, Jianguo Zhang, Shahin Shayandeh, Ahmad Beirami, and Arthur Szlam for their particularly detailed discussions. We also thank Weiyan Shi and Jesse Thomason for their feedback on early drafts of this paper.


  • A. Acharya, S. Adhikari, S. Agarwal, V. Auvray, N. Belgamwar, A. Biswas, S. Chandra, T. Chung, M. Fazel-Zarandi, R. Gabriel, S. Gao, R. Goel, D. Hakkani-Tur, J. Jezabek, A. Jha, J. Kao, P. Krishnan, P. Ku, A. Goyal, C. Lin, Q. Liu, A. Mandal, A. Metallinou, V. Naik, Y. Pan, S. Paul, V. Perera, A. Sethi, M. Shen, N. Strom, and E. Wang (2021) Alexa conversations: an extensible data-driven approach for building task-oriented dialogue systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Online, pp. 125–132. External Links: Link, Document Cited by: §1, §1, §3, §3.
  • A. Aghajanyan, A. Shrivastava, A. Gupta, N. Goyal, L. Zettlemoyer, and S. Gupta (2021) Better fine-tuning by reducing representational collapse. ICLR. Cited by: §4.1.
  • H. Ai and F. Weng (2008) User simulation as testing for spoken dialog systems. In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, pp. 164–171. Cited by: §3, §3.
  • L. E. Asri, J. He, and K. Suleman (2016) A sequence-to-sequence model for user simulation in spoken dialogue systems. arXiv preprint arXiv:1607.00070. Cited by: §3, §3.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §4.1.
  • V. Balaraman and B. Magnini (2021) Domain-aware dialogue state tracker for multi-domain dialogue systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 866–873. Cited by: §3.
  • A. Bordes, Y. Boureau, and J. Weston (2016) Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683. Cited by: §3.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §3.
  • P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gašić (2018) MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 5016–5026. External Links: Link, Document Cited by: Table 2.
  • B. Byrne, K. Krishnamoorthi, S. Ganesh, A. Dubey, K. Kim, and A. Cedilnik (2020) Taskmaster-2. Note: https://research.google/tools/datasets/taskmaster-2/Accessed: 2021-01-25 Cited by: Table 2.
  • B. Byrne, K. Krishnamoorthi, S. Ganesh, and M. Kale (2021) TicketTalk: toward human-level performance with end-to-end, transaction-based dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 671–680. External Links: Link, Document Cited by: §3, Table 2.
  • B. Byrne, K. Krishnamoorthi, C. Sankar, A. Neelakantan, B. Goodrich, D. Duckworth, S. Yavuz, A. Dubey, K. Kim, and A. Cedilnik (2019) Taskmaster-1: toward a realistic and diverse dialog dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4516–4525. External Links: Link, Document Cited by: Table 2.
  • P. A. Crook and A. Marin (2017) Sequence to sequence modeling for user simulation in dialog systems.. In INTERSPEECH, pp. 1706–1710. Cited by: §3, §3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §3.
  • M. Fazel-Zarandi, S. Li, J. Cao, J. Casale, P. Henderson, D. Whitney, and A. Geramifard (2017) Learning robust dialog policies in noisy environments. arXiv preprint arXiv:1712.04034. Cited by: §1, §3.
  • K. Georgila, J. Henderson, and O. Lemon (2006) User simulation for spoken dialogue systems: learning and evaluation. In INTERSPEECH, Cited by: §3.
  • I. Gür, D. Hakkani-Tür, G. Tür, and P. Shah (2018) User modeling for task oriented dialogues. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 900–906. Cited by: §3, §3.
  • D. Ham, J. Lee, Y. Jang, and K. Kim (2020) End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 583–592. External Links: Link, Document Cited by: §2, §3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • A. Holtzman, J. Buys, M. Forbes, and Y. Choi (2020) The curious case of neural text degeneration. ArXiv abs/1904.09751. Cited by: §3, §5.1, §6.
  • E. Hosseini-Asl, B. McCann, C. Wu, S. Yavuz, and R. Socher (2020) A simple language model for task-oriented dialogue. arXiv preprint arXiv:2005.00796. Cited by: §2, §3, §4.1.
  • S. Jung, C. Lee, K. Kim, M. Jeong, and G. G. Lee (2009) Data-driven user simulation for automated evaluation of spoken dialog systems. Comput. Speech Lang. 23 (4), pp. 479–509. External Links: ISSN 0885-2308, Link, Document Cited by: §3, §3.
  • D. Jurafsky and J. H. Martin (2009) Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, Upper Saddle River, N.J.. External Links: ISBN 9780131873216 0131873210, Link Cited by: §2.
  • F. Kreyssig, I. Casanueva, P. Budzianowski, and M. Gasic (2018) Neural user simulation for corpus-based policy optimisation of spoken dialogue systems. ArXiv abs/1805.06966. Cited by: §3, §3.
  • S. Lee, H. Schulz, A. Atkinson, J. Gao, K. Suleman, L. El Asri, M. Adada, M. Huang, S. Sharma, W. Tay, et al. (2019) Multi-domain task-completion dialog challenge. Dialog system technology challenges 8, pp. 9. Cited by: Table 2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: §1, §3, §4.1.
  • M. Li, J. Weston, and S. Roller (2019) ACUTE-EVAL: improved dialogue evaluation with optimized questions and multi-turn comparisons. In NeurIPS workshop on Conversational AI, Cited by: §4.1.
  • X. Li, Z. C. Lipton, B. Dhingra, L. Li, J. Gao, and Y. Chen (2016) A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688. Cited by: §3, §3.
  • X. Li, S. Panda, J. Liu, and J. Gao (2018) Microsoft dialogue challenge: building end-to-end task-completion dialogue systems. arXiv preprint arXiv:1807.11125. Cited by: Table 2.
  • Z. Lin, B. Liu, A. Madotto, S. Moon, P. Crook, Z. Zhou, Z. Wang, Z. Yu, E. Cho, R. Subba, et al. (2021) Zero-shot dialogue state tracking via cross-task transfer. arXiv preprint arXiv:2109.04655. Cited by: §7.
  • Z. Lin, A. Madotto, G. I. Winata, and P. Fung (2020)

    MinTL: minimalist transfer learning for task-oriented dialogue systems

    In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 3391–3405. External Links: Link, Document Cited by: §3, §4.1.
  • B. Liu and I. Lane (2018) End-to-end learning of task-oriented dialogs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, New Orleans, Louisiana, USA, pp. 67–73. External Links: Link, Document Cited by: §3.
  • A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston (2017) Parlai: a dialog research software platform. arXiv preprint arXiv:1705.06476. Cited by: §1.
  • F. Olsson (2009)

    A literature survey of active machine learning in the context of natural language processing

    Technical report Swedish Institute of Computer Science. Cited by: §1.
  • B. Peng, C. Li, J. Li, S. Shayandeh, L. Liden, and J. Gao (2020) Soloist: few-shot task-oriented dialog with a single pre-trained auto-regressive model. arXiv preprint arXiv:2005.05298. Cited by: §2, §3, §5.1.
  • B. Peng, X. Li, L. Li, J. Gao, A. Celikyilmaz, S. Lee, and K. Wong (2017) Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2231–2240. External Links: Link, Document Cited by: §3.
  • D. Peskov, N. Clarke, J. Krone, B. Fodor, Y. Zhang, A. Youssef, and M. Diab (2019) Multi-domain goal-oriented dialogues (MultiDoGO): strategies toward curating and annotating large scale dialogue data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4526–4536. External Links: Link, Document Cited by: Table 2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1, §3, §4.1.
  • A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan (2019)

    Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset

    arXiv preprint arXiv:1909.05855. Cited by: §2, §3, §4.1, §4.1, Table 2, §7.
  • J. Schatzmann, K. Georgila, and S. Young (2005) Quantitative evaluation of user simulation techniques for spoken dialogue systems. In Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal, pp. 45–54. External Links: Link Cited by: §1, §2, §3, §3.
  • J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye, and S. Young (2007) Agenda-based user simulation for bootstrapping a POMDP dialogue system. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, Rochester, New York, pp. 149–152. External Links: Link Cited by: §3, §3, §3.
  • T. Schick and H. Schütze (2020) It’s not just size that matters: small language models are also few-shot learners. arXiv preprint arXiv:2009.07118. Cited by: §3.
  • P. Shah, D. Hakkani-Tür, B. Liu, and G. Tür (2018) Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), New Orleans - Louisiana, pp. 41–51. External Links: Link, Document Cited by: §1, §1, §3, §3, §3, §3.
  • W. Shi, K. Qian, X. Wang, and Z. Yu (2019) How to build user simulators to train RL-based dialog systems. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1990–2000. External Links: Link, Document Cited by: §1, §1, §3, §3.
  • S. Su, X. Li, J. Gao, J. Liu, and Y. Chen (2018) Discriminative deep dyna-q: robust planning for dialogue policy learning. arXiv preprint arXiv:1808.09442. Cited by: §1, §3, §3.
  • G. Tur, R. E. Schapire, and D. Hakkani-Tur (2003) Active learning for spoken language understanding. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Vol. 1, pp. I–I. Cited by: §1.
  • A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra (2016) Diverse beam search: decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424. Cited by: §6.
  • M. A. Walker, D. J. Litman, C. A. Kamm, and A. Abella (1997) PARADISE: a framework for evaluating spoken dialogue agents. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 271–280. External Links: Link, Document Cited by: §1, §3.
  • M. Walker, A. Kamm, and D. Litman (2000) Towards developing general models of usability with paradise. Natural Language Engineering 6, pp. . External Links: Document Cited by: §1, §3.
  • Z. Wang, A. W. Yu, O. Firat, and Y. Cao (2021) Towards zero-label language learning. arXiv preprint arXiv:2109.09193. Cited by: §3.
  • J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le (2021) Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652. Cited by: §3.
  • S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston (2019) Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319. External Links: Link Cited by: §6.
  • C. Wu, S. C.H. Hoi, R. Socher, and C. Xiong (2020) TOD-BERT: pre-trained natural language understanding for task-oriented dialogue. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 917–929. External Links: Link, Document Cited by: §5.1.

Appendix A Appendix

a.1 Screenshot of Annotator UI

Figure 8: Screenshot of Annotator UI. Annotators are asked to evaluate "Which Conversational Assistant System is better" by pressing radio buttons corresponding to two presented conversations.

a.2 Pretraining Datasets

Dataset Dial. Dom. Overlap with Google SGD
GoogleSGD In-Domain 18,986 16 Alarm, Banks, Buses, Calendar, Events, Flights, Hotels, Media, Movies, Music, Restaurants, Rideshare, Services, Travel, Trains, Weather
GoogleSGD Out-of-Domain 3,839 4 Home Search, Messaging, Payment, Rental Cars
MetaLWoZ 37,884 47 Banks, Buses, Events, Movies, Music, Restaurants
MSR-E2E 10,087 3 Movies, Restaurants, Taxis
MultiDoGo 19,522 6 Calendar, Flights, Media, Weather
MultiWoz 10,438 7 Attractions, Hospitals, Hotels, Restaurants, Taxis, Train
Taskmaster-1 13,215 6 Restaurants, Rideshare
Taskmaster-2 17,289 7 Flights, Hotels, Movies, Music, Restaurants
TicketTalk 23,789 1 Movies
Table 6: Detailed statistics of datasets used in our work. All datasets except for Google SGD Out-of-Domain used in pretraining.

We describe the different datasets that we use for pretraining. In general, we attempt to make these datasets be structured as similarly as possible to the conversations format as described in Sec 2 and generate data for separate User and Assistant models once formatted. For datasets with no API Call or Response labels, we imitate these values by accumulating dialogue state across user and assistant responses, respectively, and presenting these on appropriate turns. Other exceptions are described inline.

See Table 6 for dataset statistics.

Google SGD

We describe Google SGD in 4.1. We describe our method for splitting Google SGD into In-Domain and Out-of-Domain splits in 5.1.

For our Out-of-Domain split, we used Home Search, Messaging, Payment, Rental Cars as 4 holdout domains that did not have analogues in any of the other datasets. Though the "Services" and "Travel" domains do not occur explicitly in the other datasets, we do not include them in our holdout since they include semantically similar information to the "Hospital" and "Attractions" domains of MultiWoz, respectively. For our In-Domain split, we include solely the 16 other domains of the dataset.

As also mentioned in 5.1, since Google SGD is a dataset that contains both single-goal and mulit-goal conversations, some domains of the In-Domain split are present as goals in multi-goal conversations of the Out-of-Domain split.


MetalWoz is a dataset constructed in a Wizard of Oz fashion across 227 tasks and 47 domains. Given a domain and a task, conversing pairs were asked to chat for 10 turns to satisfy the user’s queries.

As this dataset does not include any annotations about API Calls, API Responses, or belief state, we pretrain on this dataset as-is and do not attempt to transform it into the format described in Sec 2. We do however split this dataset into separate User and Assistant versions.


MultiDoGo is a large task-oriented dataset collected in a Wizard of Oz fashion, using both crowd and expert annotators with annotations at varying levels of granularity. We use only the data available publicly on this dataset’s open-source repository (about 20k dialogues total.)


MultiWoz is a dataset of single and multi-goal human-human conversations collected in a Wizard of Oz fashion. Validation and test sets contain only successful conversations while the train set include some that are incomplete. Data of the original dataset is labelled with belief states.


MSR-E2E is a dataset of human-human conversations in which one human plays the role of an Agent and the other one plays the role of a User. Data is collected from Amazon Mechanical Turk.

Taskmaster 1

Conversations in Taskmaster 1 were collected in one of two ways: spoken Wizard of Oz conversations between humans (transcribed to text) as well as written conversations from a single human in a self-dialog method. Similar to our conversations format, rather than being labelled with intents and dialog acts, conversations of this dataset are labelled with simple API arguments.

Taskmaster 2

Taskmaster2 is a dataset of entirely spoken two person dialogues collected in a Wizard of Oz manner where Assistant utterances were typed by a human and then "spoken" using a text-to-speech service. Dialogues in this dataset includes those that are search and recommendations oriented, rather than purely task execution.


TicketTalk (or Taskmaster 3) is a dataset of movie ticket dialogues collected in a self-chat manner. To induce conversational variety, crowd workers were asked to generate conversations given dozens of different instructions of different level of specificity, some purposefully including conversational errors.

a.3 Hyperparameter Tables

We include hyperparameter tables for each of our models used in this paper. All models were trained using the ADAM optimizer. All models were optimized using Token Exact Match (examples with perfect greedy decoding) as an early stopping criteria.

Evaluating Synthetic Dialog Generation and its Metrics

Note that as described in Sec 4, we aim to look for reasonable correlations between metrics and model architectures rather than absolute performance.

LSTM & LSTM with Attention:

(1 GPU per run)

Hyperparameter Swept Values
Learning Rate 1e{-3, -4}
Number of Layers {1, 2, 4}
Embedding size {256, 384}
Hidden size {1024, 2048}
Batch size 64
Embedding Init FastText


(8 GPUs per run)

Hyperparameter Swept Values
Learning Rate 1e{-5, -6}
LR Scheduler Reduce on Plateau, Invsqrt
Model Size 124M
Text Truncate 512
Warm-up Updates 100
Batch size 4
Update Frequency 2
Gradient Clip 1


(8 GPUs per run)

Hyperparameter Swept Values
Learning Rate 1e{-5, -6}
LR Scheduler {Reduce on Plateau, Invsqrt}
Model Size 400M
Text Truncate 512
Warm-up Updates 100
Batch size 4
Update Frequency 2
Gradient Clip 1

Bootstrapping Novel Domains

Once we early stopped models on Token Exact Match, we used TSR on validation goals of Google SGD to select the best model out of a given hyperparameter sweep. However, a post hoc analysis suggests that Token Exact Match would have worked approximately as well for the goal of improving JGA.


(8 GPUs per run)

Hyperparameter Swept Values
Learning Rate {1e-4, 5e-5, 1e-5, 5e-6}
Model Size 400M
Batch size 4
Update frequency 8
LR Scheduler Invsqrt
Warm-up updates 1000
Text truncate 512
Label truncate 512
Gradient Clip 0.1
Multitask Weights 1
Validation Steps 100

a.4 Additional Results

We report additional metrics on each of our models, for both offline (static, held-out data) and online (during simulation) settings.

Figure 9: Offline Bootstrapping Results. Results of Bootstrapping on a static (held-out) offline dataset.
Figure 10: Online Bootstrapping Results. Results of Bootstrapping models on during simulations.
Figure 11: Offline Active Learning Results. Results of Active Learning on a static (held-out) offline dataset.
Figure 12: Online Active Learning Results. Results of Active Learning model during simulations.

a.5 Holdout API Analysis of Bootstrap Procedure

We take the models generated from Sec 5.1 and evaluate the models for JGA, limiting only turns to which include API calls, over each of the holdout domains. Results of this are shown in Figure 13. We observe that all holdout domains have a JGA value of zero for the Base model; this validates our selection of holdout domains. We also observe that compared to the other models, Active Learning performs much better across all domains.

Model Find Homes Payment Rental Cars Messaging
Base Model .000 .000 .000 .000
Few-shot only .603 .022 .411 .768
Zero-Shot Sim. .876 .022 .266 .929
Active Learning .882 .870 .623 .946
Figure 13: JGA of individual Holdout Domains (limited to API Call Turns only).