SOLOIST: Few-shot Task-Oriented Dialog with A Single Pre-trained Auto-regressive Model

by   Baolin Peng, et al.

This paper presents a new method SOLOIST, which uses transfer learning to efficiently build task-oriented dialog systems at scale. We parameterize a dialog system using a Transformer-based auto-regressive language model, which subsumes different dialog mod-ules (e.g.,state tracker, dialog policy, responsegenerator) into a single neural model. We pre-train, on large heterogeneous dialog corpora, a large-scale Transformer model which can generate dialog responses grounded in user goals and real-world knowledge for task completion. The pre-trained model can be efficiently adapted to accomplish a new dialog task with a handful of task-specific dialogs via machine teaching. Our experiments demonstrate that (i) SOLOIST creates new state-of-the-art results on two well-known benchmarks, CamRest and MultiWOZ, (ii) in the few-shot learning setting, the dialog systems developed by SOLOIST significantly outperform those by existing methods, and (iii) the use of machine teaching substantially reduces the labeling cost. We will release our code and pre-trained models for reproducible research.


GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-Supervised Learning and Explicit Policy Injection

Pre-trained models have proved to be powerful in enhancing task-oriented...

SYNERGY: Building Task Bots at Scale Using Symbolic Knowledge and Machine Teaching

In this paper we explore the use of symbolic knowledge and machine teach...

Toward Self-Learning End-to-End Dialog Systems

End-to-end task-oriented dialog systems often suffer from out-of-distrib...

EmpTransfo: A Multi-head Transformer Architecture for Creating Empathetic Dialog Systems

Understanding emotions and responding accordingly is one of the biggest ...

Few-shot Natural Language Generation for Task-Oriented Dialog

As a crucial component in task-oriented dialog systems, the Natural Lang...

Robust Conversational AI with Grounded Text Generation

This article presents a hybrid approach based on a Grounded Text Generat...

Conversation Learner – A Machine Teaching Tool for Building Dialog Managers for Task-Oriented Dialog Systems

Traditionally, industry solutions for building a task-oriented dialog sy...

Code Repositories

1 Introduction

The increasing use of personal assistants and messaging applications has spurred interest in building task-oriented dialog systems that can communicate with users through natural language to accomplish a wide range of tasks, such as restaurant booking, weather query, flight booking, IT helpdesk, etc.  The wide variety of tasks and domains has created the need for a flexible task-oriented dialog development platform that can support many different use cases, while remaining straightforward for developers to use and maintain.

A typical task-oriented dialog system uses a modularized pipeline, that has four modules and executes sequentially, as shown in Figure 1(a). A natural language understanding (NLU) module identifies user intents and extracts associated information such as slots and their values from user’s input. A dialog state tracker (DST) infers the belief state (or user goal) from dialog history. The belief state is often used to query a task-specific database (DB) to obtain the DB state, such as the number of entities that match the user goal. The dialog state and DB state are then passed to a dialog policy (POL) to select the next system action. A natural language generation (NLG) module converts the action to a natural language response.

Most popular commercial tools for dialog development employ the modular systems above, including Google’s Dialog Flow222, Microsoft’s LUIS333 and Bot Framework444, Facebook’s Wit.ai555, Amazon’s Lex666, and IBM’s Watson Assistant777 They are designed mainly to help develop systems manually, i.e., writing code, crafting rules and templates. Unfortunately, even with these tools, building dialog systems remains a label-intensive, time-consuming task, requiring rich domain knowledge, reasonable coding skill, and expert experience. The cost of building dialog systems at scale (i.e., hundreds of bots for different tasks) can be prohibitively expensive.

Due to the recent advances in neural approaches to conversational AI Gao et al. (2019), researchers are developing data-driven methods and neural models for either individual dialog modules or end-to-end systems. For example, recent attempts such as RASA Bocklisch et al. (2017), ConvLab Lee et al. (2019); Zhu et al. (2020), and Conversation Learner Shukla et al. (2020)

have been made to allow the use of data-driven approaches based on machine learning or machine teaching for the development of dialog modules. End-to-end trainable dialog systems have been studied in 

Wen et al. (2016); Zhao and Eskenazi (2016); Li et al. (2017); Williams et al. (2017); Lei et al. (2018); Gao et al. (2019); Zhang et al. (2019a). Although these methods have achieved promising results, they require large amounts of task-specific labeled data for training, which are rarely available for new tasks.

In this paper we propose a novel task-oriented bot building paradigm, , which significantly eases the workflow of training and deploying dialog systems for new tasks, compared to existing tools and methods. Our approach is inspired by the recent success of applying transfer learning to natural language processing (NLP) tasks through the use of pre-trained models (

e.g., BERT Devlin et al. (2019), RoBERTa Liu et al. (2019) and UniLM Dong et al. (2019)): It has been shown that a large-scale language model pre-trained on raw text can be effectively fine-tuned to a wide range of NLP tasks with few in-domain labels. The proposed is based on a similar pre-training-and-fine-tuning framework. We parameterize a dialog system using a Transformer-based auto-regressive language model Vaswani et al. (2017), which subsumes different dialog modules (NLU, DST, POL and NLG) into a single neural model. Task-oriented bot building proceeds in two stages: In the pre-training stage, initialized using GPT-2, we train a large-scale Transformer model of generating dialog responses that are grounded in dialog history, user goals and real-world knowledge for task completion using large heterogeneous dialog corpora. In the fine-tuning stage, we fine-tune the pre-trained model to complete a new task using a a handful of task-specific dialogs via machine teaching Zhu (2015); Shukla et al. (2020).

We show through a comprehensive empirical study that is an effective solution to building task-oriented bots at scale which successfully transfers, from the pre-trained model to a new task-specific bot, two capabilities: the capability of understanding and generating natural language sentences, learnt by GPT-2 on raw text, and the capability of grounding text responses in user goals and real-world knowledge for task completion, learned from out-domain dialog corpora.

achieves new state-of-the-art results on two standard benchmarks, lifting the combined score by 10 points. In the few-shot settings, adapts to new domain much more effectively than competing methods, achieving a reasonable success rate using less than 50 dialogs. The promising results demonstrate the potential of the new paradigm for developing task-oriented dialog bots. Instead of collecting, labeling data, and building a bot per task, we can pre-train a universal, grounded language generation model, and adapt it to new tasks via transfer learning and machine teaching.


2.1 An Auto-regressive Model for Dialog

The modular dialog system in Figure 1 constitutes a data processing procedure that naturally produces a sequence, through concatenating input/output of each module along the generation process. Each consecutive pair in this sequence plays the role of annotated data for the corresponding module. Ideally, when the entire sequence is available, the data generation process of dialog system (NLU, DST, POL, NLG) can be formulated as a single auto-regressive model, and the full sequence can be learned in a self-supervised manner.

GPT-2 Radford et al. (2019) is one of the most known auto-regressive language models. GPT-2 learns the language granularity from large amounts of open Web text data, and after fine-tuned using conversational data Zhang et al. (2019b) can respond users with realistic and coherent continuations about a topic of their choosing. However, the generated responses are not useful for completing any specific task due to the lack of grounding. inherits GPT-2’s capability of producing human-like responses. But, unlike GPT-2, learns to ground the generation process in user goals and real-world knowledge so that the generated responses are useful for completing tasks. Note that is a general framework for grounded language generation, where prescribed control codes in wildly different but related domains are used to pre-train a generic guided language model, which can quickly adapt to a new domain through fine-tuning with a few task-specific examples. Specifically for task-oriented task systems, we consider to pre-train on conversational data with grounding information, i.e., belief states and DB states. More specifically, each dialog turn in our training data is represented as:


where is the entire dialog history up to the current dialog turn, is the belief state acquired from human annotation, is the DB state automatically retrieved from a database using , and is the delexicalized dialog response, from which the system response in natural language can be easily obtained with some automatic post-processing. Each item in is by itself a sequence of tokens, as illustrated by examples in Figure 1(b). Thus, it is natural to treat the concatenation of them as a long sequence for model training as shown in Figure 1(c). We pre-train the model using heterogeneous dialog corpora with labels of belief states and DB states, which are publicly available. The pre-trained model can be fine-tuned to any new task to generate responses grounded in task-specific user goals and database.

2.2 Task-Grounded Pre-training

Given training data of samples , our goal is to build a statistical model parameterized by to characterize . We consider a multi-task objective for learning

, where each task is a self-supervised learning task.

To leverage the sequential structure of a task-oriented dialog system, the joint probability

can be factorized in the auto-regressive manner as:


where the factorization from (2) to (3) is based on the fact that , because the DB state is obtained using a deterministic database-lookup process given a belief state (e.g., via an API call). Note that (3

) decomposes the joint distribution modeling problem into two sub-problems: belief state prediction

and grounded response generation . Since and are sequences as well, we may further factorize them in the left-to-right auto-regressive manner, respectively.

Task 1: Belief Prediction.

For the belief state sequence of length , we define the objective of predicting the belief token as:


where indicates all tokens before .

Task 2: Grounded Response Generation.

To generate a response of length , , our model generate every token grounded in the previous word tokens and task-oriented information :


Task 3: Contrastive Objective.

We consider a contrastive objective to promote the matched items (positive samples ), while driving down mismatches (negative samples ). Specifically, we sample a set of negative samples from sequence by replacing some items in with probability 50% with different items randomly sampled from the dataset . Since the the special token [EOS] attends all tokens in the sequence, the output feature on [EOS]

is the fused representation of all items, we apply a binary classifier on top of the feature to predict whether the items of the sequence are matched (

) or mismatched ():


We consider three types of negative samples , each of which is chosen with probability 1/3: negative belief, where only the belief item is replaced negative response, where only the response item is replaced negative belief + response, where both the belief and response items are replaced.

Full Pre-training Objective.

Learning is performed via maximizing the log-likelihood (MLE) over the entire training dataset , using a joint objective that combines (4), (5) and (6):


In Figure 1(c), we illustrate the model architecture and learning objectives. The model is fully auto-regressive in a left-to-right manner, each objective appearing on a given sub-sequence and a special segment token.

Implementation Details

We process each dialog turn in training data into a sequence of tokens. For instance, the processed sequence of the examples shown in Figure 1(b) is as follows, where different items are rendered in different colors. User : I would like to find an expensive restaurant that severs Chinese food. System : sure, which area do you prefer ? User : How about in the north part of town . Belief State : Restaurant { pricerange = expensive, food = Chinese, area = north } DB : Restaurant 1 match The [restaurant_name] is a great [value_food] restaurant. Would you like to book a table there ?

Name #Dialog #Utterance Avg. Turn #Domain
MetaLWOZ 37,884 432,036 11.4 47
Schema 22,825 463,284 20.3 17
Taskmaster 13,215 303,066 22.9 6
MultiWOZ 10,420 71,410 6.9 7
MSR-E2E 10,087 74,686 7.4 3
SMD 3,031 15,928 5.3 3
Frames 1,369 19,986 14.6 3
WOZ 1,200 5,012 4.2 1
CamRest676 676 2,744 4.1 1
Table 1: Statistics of dialog corpora Wu et al. (2020a)
Figure 2: Illustration of the machine teaching process using conversion learner. The human-bot conversion log in (a) can be edited via correcting its belief state in (b), and selecting/inserting a more appropriate response in (c).

This sequence can be directly fed into an auto-regressive model for training, as shown in Figure 1

(c). Our implementation is based on Huggingface Pytorch Transformer

Wolf et al. (2019a). We use GPT-2 with 117M parameters as the initial model for pre-training, and byte pair encodings (Sennrich et al., 2015) for the tokenization. We pre-train the grounded response generation model on dialog corpora Kim et al. (2019); Rastogi et al. (2019); Byrne et al. (2019); Eric and Manning (2017); Mrkšić et al. (2016); Asri et al. (2017)888The results reported in this paper are based on pre-trained models on Schema dataset. We are pre-training a larger model using all the corpora. as shown in Table 1. We use Adam Kingma and Ba (2014) with weight decay to pre-train the model for 100k steps.

2.3 Few-shot Fine-tuning

When deploying to a new task, we collect task-specific in the same format as that in the pre-training stage as (1). When annotated log data is available, the conventional fine-tuning procedure is utilized: we use the same multi-task objective of (7) to update to adapt the model to complete the new task using labeled task-specific dialogs.

In real applications, annotated log data is often unavailable, or noisy/incomplete beforehand. One may deploy the model, and acquire high-quality task-specific labels (e.g., belief state and system response) for each dialog turn using machine teaching (Simard et al., 2017; Zhu, 2015; Williams and Liden, 2017; Shukla et al., 2020)

. Machine teaching is an active learning paradigm that focuses on leveraging the knowledge and expertise of domain experts as “teachers”. This paradigm puts a strong emphasis on tools and techniques that enable teachers - particularly non-data scientists and non-machine-learning experts - to visualize data, find potential problems, and provide corrections or additional training inputs in order to improve the system’s performance.

We proceed fine-tuning using Conversation Learner Shukla et al. (2020), a machine teaching tool, in the following steps: Dialog authors deploy the pre-trained model for a specific task. Users (or human subjects recruited for system fine-tuning) interact with the system and generate human-bot dialog logs. Dialog authors revise a dozen of training samples by selecting representative failed dialogs from the logs, correcting their belief and/or responses so that the system can complete these dialogs successfully. We illustrate the dialog editing process using Conversion Learner in Figure 2. Readers may refer to  Shukla et al. (2020) for details. The corrected task-specific dialog turns are used to fine-tune the model. It is shown that machine teaching is a more effective approach to improving deployed dialog systems by providing on-the-spot corrections.

Implementation Details

Instead of using machine teaching from scratch, we assume that a few task-specific data are available for fine-tuning. Details are presented in Sec. 3.3

. Training examples are truncated to ensure max length 512. The model is trained with a mini-batch of 6 on 8 Nvidia V100 until observing no significant progress on validation loss or up to 10 epochs. Nucleus sampling 

Holtzman et al. (2019) is used for decoding, where the sampling top-p ranges from 0.2 to 0.5 for all our models. The best setup of hyper-parameters is selected through grid-search on the validation set.

Model Annotations Evaluation Metrics
Belif State Policy
Sequicity (Lei et al., 2018) 92.30 85.30 21.40 110.20
Sequicity (w/o RL) 94.00 83.40 23.40 112.10
GPT-2 finetuned - 86.20 19.20 -
ARDM (Wu et al., 2019) - 87.10 25.20 -
94.70 87.10 25.50 116.40
Table 2: End-to-End Evaluation on CamRest676.
Model Annotations Evaluation Metrics
Belif State Policy
Baseline Budzianowski et al. (2018) 71.29 60.94 18.80 84.93
TokenMoE (Pei et al., 2019) 75.30 59.70 16.81 84.31
GPT-2 (Budzianowski and Vulić, 2019) 70.96 61.36 19.05 85.21
HDSA (Chen et al., 2019) 82.90 68.90 23.60 99.50
Structured Fusion (Mehri et al., 2019) 82.70 72.10 16.34 93.74
LaRL (Zhao et al., 2019) 82.80 79.20 12.80 93.80
ARDM (Wu et al., 2019) 87.40 72.80 20.60 100.70
DAMD (Zhang et al., 2019a) 89.20 77.90 18.60 102.15
89.60 79.30 18.03 102.49
Table 3: Context-to-response evaluation on MultiWOZ.
Model Annotations Evaluation Metrics
Belif State Policy
Sequicity (Lei et al., 2018) 66.41 45.32 15.54 71.41
HRED-TS (Peng et al., 2019) 70.00 58.00 17.50 81.50
Structured Fusion (Mehri et al., 2019) 73.80 58.60 16.90 83.10
DSTC8 Track 1 Winner 11footnotemark: 1 (Ham et al., 2020) 73.00 62.40 16.00 83.50
DAMD (Zhang et al., 2019a) 76.40 60.40 16.60 85.00
85.50 72.90 16.54 95.74
11footnotemark: 1

The result of DSTC8 Track 1 Winner is produced by adapting their code to our current setting.

Table 4: End-to-end evaluation on MultiWOZ.

3 Experiments

In this section, we evaluate the proposed to answer three research questions: Q1: How does perform on standard benchmarks compared to SoTA? Q2: Does meet the goal of effectively generalizing to new domains in the few-shot learning setting? Q3: Is machine teaching a more efficient approach to fine-tuning when applied? Note that we employed the conventional fine-tuning scheme without machine teaching for fair comparison when studying Q1 and Q2.

3.1 Experimental Setup

Datasets for Fine-tuning

We validate the proposed on two public datasets. CamRest676 is a single-domain task-oriented dialog corpus collected by Wen et al. (2016). It contains 408/136/136 dialogs for training/validation/testing, respectively. Following Lei et al. (2018), we delexicalize each token that occurs in the ontology with its slot names such as restaurant name, phone number, and postcode. MultiWOZ dataset Budzianowski et al. (2018) is a large-scale human-human multi-domain task-oriented dialog dataset. It contains 8438/1000/1000 for training/validation/testing, respectively. Each dialog session in the corpus contains 1 to 3 domains, including Attraction, Hotel, Hospital, Police, Restaurant, Train, and Taxi. MultiWOZ is inherently challenging due to its multi-domain setting and diverse language styles.

Automatic Evaluation Metrics.

Following Budzianowski et al. (2018), , , and scores are reported. measures if the system provides an correct entity (inform rate). measures the exact matching of answering all the requested information (success rate) and if the answered information matches users’ goal . evaluates how natural the generated utterance is compared with human readers. A combined score () is also reported using as an overall quality measure, as suggested in Budzianowski et al. (2018).


We compare with several strong baseline methods, which hold state-of-the-arts on CamRest676 or MultiWOZ datasets. Multi-Action Data Augmentation (DAMD) Zhang et al. (2019a)

is a state-of-the-art modular system, where each dialog module is implemented using a neural network, and the whole system is trained in an end-to-end manner.

Sequicity (Lei et al., 2018) is similar to DAMD except that it does not use multi-action data augmentation. A GPT-2 model that is fine-tuned on dialog data. The model is not grounded, and needs to work with a separate dialog state tracker for task completion. ARDM (Wu et al., 2019) utilizes GPT-2 as the pre-trained model to learn to generate role-aware responses given dialog context. The model has to work with a separate dialog state tracker for task completion. HDSA Chen et al. (2019) is a modular dialog system which generates responses using BERT-based dialog policy and graph structure dialog act representations.

3.2 Comparing to SOTA systems


Table 2 shows the results using generated belief states on CamRest676. The annotations unitized by the models are also listed. achieves the best scores over all the metrics. ARDM performs similarly to in terms of Success and BLEU score. However, ARDM is infeasible to track dialog states and requires a separately trained state tracker to accomplish tasks. GPT-2 fine-tuned with task-specific data works reasonably good but legs behind by a large margin. Sequicity (Lei et al., 2018) is a jointly trained model with belief state and policy annotation, and under-performs . This result suggests that in simple tasks like CamRest676, is able to achieve user goals with only belief state annotations and maintains good fluency due to the benefit from task-grounded pre-training.

MultiWOZ Context-to-Response

We first consider the context-to-response generation task (Wen et al., 2016), where the ground truth belief states and database search results are given, based on which responses are predicted. The results are shown in Table 3. The proposed achieves the best performance in terms of Inform and Success scores but performs slightly worse in terms of BLEU score. The overall combined score of is comparable with the current SoTA method DADM Zhang et al. (2019a). However, DAMD leverages the labels of dialog act on both the user and system sides, which demands significantly higher labeling efforts than . HDSA achieves the best number on BLEU. Compared to HDSA, is much simpler and able to perform better in terms of combined score. performs better than ARDM on combined score. It is worth mentioning that ARDM does not consider dialog state tracking and thus requires an extra dialog state tracker to accomplish a certain task. These results reveal that is able to learn dialog policy accurately and generate natural language responses in the multi-domain scenario.

MultiWOZ End-to-End

We now consider a more pragmatic evaluation setting of studying a model’s end-to-end learning performance, where the generated belief states are used for database search and response generation. The results are shown in Table 4. achieves the best performance in terms of inform and success rates, and combined score, lifting the previous SOTA by DAMD by a significant margin (e.g., about 10 points improvement on the combined score). Our method also outperforms the method of Ham et al. (2020), where GPT-2 is fine-tuned and applied to end-to-end dialog. Compared with the classical modular dialog systems or the jointly trained model DAMD, it is worth noting that has a much simpler architecture and requires much lower labeling effort. For example, requires only the belief states, while DAMD requires additional annotations for task definition (i.e., defining the intents, slots, and the corresponding value ranges) and dialog acts.

Domain Attra. Train Hotel Rest. CamRest
#Train 50 50 50 50 20
#Valid 50 50 50 50 136
#Test 100 200 200 200 136
Table 5: Data statistics for domains used in few-shot evaluation. Attra. denotes attraction domain and Rest. means restaurant domain.
Model CamRest
Sequicity (Lei et al., 2018) 60.61 66.11 11.15
w/o pre-training 73.88 72.22 13.11
85.82 84.22 19.18
Table 6: End-to-end evaluation on CamRest in a few-shot learning setting.
Model Attraction Train Hotel Restaurant
DAMD (Zhang et al., 2019a) 70.00 15.00 6.90 75.00 39.50 6.20 62.50 20.50 7.60 68.00 19.50 10.50
w/o pre-training 65.66 46.97 5.85 59.00 44.00 7.07 62.50 40.00 7.70 75.50 44.50 11.00
86.00 65.00 12.90 80.81 64.65 9.96 74.50 43.50 8.12 81.00 55.50 12.80
Table 7: End-to-end evaluation on Multiwoz in a few-shot learning setting.
Model 1% 5% 10% 20%
DAMD (Zhang et al., 2019a) 34.40 9.10 8.10 52.50 31.80 11.60 55.30 30.30 13.00 62.60 44.10 14.90
w/o pre-training 46.10 24.40 10.39 63.40 38.70 11.19 64.90 44.50 13.57 70.10 52.20 14.72
58.40 35.30 10.58 69.30 52.30 11.80 69.90 51.90 14.60 74.00 60.10 15.24
Table 8: End-to-end Evaluation on MultiWOZ with varying sizes of training data.

3.3 Few-shot Evaluation

It is desired that dialog systems can effectively generalize with a few training examples. We argue that the few-shot learning setting is a more realistic scenario for dialog modeling. Unfortunately, the existing corpus typically contains hundreds to thousands of dialogs depending on the complexity of dialog tasks. As such, we re-organize CamRest676 and MultiWOZ to simulate the few-shot learning setting for the end-to-end dialog modeling999We will release the re-organized datasets.. We sample from MultiWOZ dialogs the dialog tasks that only contain one domain. Attraction, Train, Hotel, and Restaurant domains are used. We ignore Police, Taxi, Hospital, as these tasks do not require an explicit tracking state to accomplish the task. For each task (or domain), we randomly sample 50 dialog sessions for training and validation and 200 dialog sessions for testing, except the Attraction domain that only has 100 sessions for testing. For CamRest, we only randomly sample 20 sessions from the original CamRest676 since this dataset is relatively small. Details are shown in Table 5.

Table 6 and 7 report the end-to-end performance in the few-shot learning settings on CamRest and MultiWoz, respectively. In all the tasks, shows substantially better performance on all the metrics. Removing pre-training on dialog corpora downgrades the performance of , but still consistently outperforms DAMD in all the domains. Removing pre-training, is conceptually similar to Ham et al. (2020), but is architecturally simpler and needs less annotations. This also verifies the importance of grounded pre-training on annotated dialog corpora, allowing to learn how to track dialog and database states to accomplish a task.

We conduct experiments to fine-tune by varying percentages of training data ranging from 1% (80 examples) to 20% (1600 examples) on the MultiWOZ dataset. As shown in Table 8, consistently outperforms DAMD for a wide range of dataset sizes, and the improvement is more substantial when smaller numbers of in-domain labels are used for fine-tuning.

72.09 44.19 9.30 67.44
+ Extra 79.07 45.35 10.00 72.01
+ Teach 77.91 58.14 12.00 79.67
Table 9: Machine teaching results.

3.4 Machine Teaching Results

We leverage the user interface (UI) of Conversation Learner (Shukla et al., 2020) for dialog authors (human teachers) to correct wrong or inadequate responses, and evaluate the potential of being continually improved after the deployment. Table 9 shows the results. We firstly use 20 dialog sessions to perform the few-shot fine-tuning step of . Its evaluation result is listed in the first row of table 9. We then deploy it to interact with users. The row of + Teach shows the result of machine teaching where a human teacher is involved to manually correct 5 failed dialog sessions, which are then utilized to continually fine-tune the deployed system. We observe that + Teach improves the combined score by 10% compared to that without human teaching. The results demonstrate the effectiveness of our two-step fine-tuning scheme to deploy for a new task. + Extra is used as an ablation baseline, where 5 randomly selected dialog sessions are added as extra examples to fine-tune the model. It shows lower performance than machine teaching. Assume that one slot-value pair of belief state correction counts one edit and a response correction counts ten edits. The total numbers of edits for + Teach and + Extra are 61 and 396, respectively, suggesting that machine teaching reduces the labeling cost by 6x.

3.5 Ablation Analysis

To study the effect of different schemes to construct negative samples in the pre-training objectives, we conduct an ablation analysis on MultiWOZ. The results are reported in Table 10. We see that removing belief state only from negative samples substantially degrades the inform and success score. We further remove the mismatched responses, and only use the mismatched belief state and response pairs to constructive negative samples. This decreases the success rate and BLEU score. The results show the importance of constructing various types of negative samples in designing an effective contrastive objective.

Interactive Example.

Figure 3 depicts an interactive example using as the agent to communicate with a user, who wants to first get the phone number and address of an attraction in the center of town, then book a table for two people on Thursday at 18:00 at restaurant la raza. The belief state at the final turn is shown in the last row. We see that accurately tracks dialog states, responsibly converses with the user, and successfully accomplishes the task. For example, at the 13-th turn, the user asks again the information about the attraction domain, can switch to the attraction domain smoothly and respond appropriately.

Full objective 85.50 72.90 16.54 95.74
- w/o belief 81.50 69.30 16.82 92.22
- w/o belief & response 82.50 67.30 16.28 91.18
Table 10: Ablation study on different negative samples in the contrastive objective on MultiWOZ in the end-to-end evaluation setup; The 2nd and 3rd row indicate removing individual belief only and individual belief & response, respectively.
Figure 3: An interactive example.

4 Related Work

Pre-trained Language Models.

Recent advances on self-supervised learning have witnessed the blooming of large-scale pre-trained language models Devlin et al. (2019); Radford et al. (2019), which achieved state-of-the-art performance on a variety of language understanding and generation tasks. The closest line of research to ours are GPT-2 Radford et al. (2019), and its variants to ground language generation on the prescribed control codes such as CTRL Keskar et al. (2019) and Grover Zellers et al. (2019), or latent variables such as Optimus Li et al. (2020).

Specifically in dialog domains, several latest works have adapted the pre-trained models to the task-oriented and chit-chat dialog systems. For chit-chat dialog systems, DialoGPT Zhang et al. (2019b); Wolf et al. (2019b) and CGRG Wu et al. (2020b) extended GPT-2 Radford et al. (2019) to ground on dialog response generation settings. Plato Bao et al. (2019) is a pre-trained discrete latent variable model for response generation. Meena Adiwardana et al. (2020) and BST Roller et al. (2020) pre-train extremely large models and have demonstrated expressive results on the social chit-chat conversation. For task-oriented dialogs, BERT-ToD Wu et al. (2020a) adapts the pre-trained BERT Devlin et al. (2019) to achieve super performance on four dialog subtasks. SC-GPT Peng et al. (2020) is a pre-trained model for the NLG module that converts a dialog act into a response in natural language. The proposed generalize the idea to the entire dialog pipeline.

End-to-end Trainable Dialog Systems.

The end-to-end trainable networks for dialog systems have been studied in Wen et al. (2016); Lei et al. (2018). Though these methods have achieved promising results, they were usually designed for a specific domain, rendering difficulties in generalizing to multi-domains such as the recent MultiWOZ dataset (Budzianowski et al., 2018) and ConvLab Lee et al. (2019). To tackle this, several models were proposed to handle the multi-domain dialog response generation Pei et al. (2019); Budzianowski and Vulić (2019); Mehri et al. (2019); Zhao et al. (2019); Wu et al. (2019); Zhang et al. (2019a). However, these works need a significant number of in-domain training examples to achieve good performance, facing challenges in the few-shot learning settings. In contrast, our can easily generalize to multiple new domains with a few labelled examples.

To the best of our knowledge, the most related work to ours is (Ham et al., 2020), which was the first attempt to leverage GPT-2 to fine-tune on the new task-oriented dialogs task101010We are aware of a concurrent work by (Hosseini-Asl et al., 2020) following this line of research.. However, our work is different from (Ham et al., 2020) in two major aspects: We first pre-train our model on a large number of out-of-domain task-oriented dialog turns to endow the model with task-grounded language generation ability, then fine-tune it on new domains. However, Ham et al. (2020) directly fine-tuned GPT-2 on new domains, which shows inferior performance than . The model in  (Ham et al., 2020)

requires more expensive annotation, and not truly end-to-end trainable. It needs heuristic rules to handle different database search conditions. Further, it separately formulates POL and NLG, which requires annotations on dialog acts. While our model requires to annotate only the belief state, showing lower annotation cost than existing methods. It is fully trainable thanks to the simplified but effective input representations.

5 Conclusion

In this paper, we have presented . Unlike GPT-2, grounds response generation in user goals and knowledge for task completion. Machine teaching is used to boost the fine-tuning performance. Experimental results on two benchmarks demonstrate that creates new state-of-the art performance. When a few labelled examples are available in new domains, outperforms existing methods by a large margin.

We hope that can inspire the community to comprehensively explore the new paradigm for building task-oriented dialog systems: formulating task-oriented dialog as a single auto-regressive model, pre-training a grounded response generation model on heterogeneous dialog corpora, and adapting the pre-trained model to new tasks through fine-tuning using a handful task-specific examples via machine teaching.


We are grateful to the entire Philly Team inside Microsoft for providing our computing platform.


  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al. (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Cited by: §4.
  • L. E. Asri, H. Schulz, S. Sharma, J. Zumer, J. Harris, E. Fine, R. Mehrotra, and K. Suleman (2017) Frames: a corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057. Cited by: §2.2.
  • S. Bao, H. He, F. Wang, and H. Wu (2019) PLATO: pre-trained dialogue generation model with discrete latent variable. arXiv preprint arXiv:1910.07931. Cited by: §4.
  • T. Bocklisch, J. Faulkner, N. Pawlowski, and A. Nichol (2017)

    Rasa: open source language understanding and dialogue management

    CoRR abs/1712.05181. External Links: Link, 1712.05181 Cited by: §1.
  • P. Budzianowski and I. Vulić (2019) Hello, it’s gpt-2–how can i help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:1907.05774. Cited by: Table 3, §4.
  • P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gašić (2018) Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278. Cited by: Table 3, §3.1, §3.1, §4.
  • B. Byrne, K. Krishnamoorthi, C. Sankar, A. Neelakantan, D. Duckworth, S. Yavuz, B. Goodrich, A. Dubey, A. Cedilnik, and K. Kim (2019) Taskmaster-1: toward a realistic and diverse dialog dataset. arXiv preprint arXiv:1909.05358. Cited by: §2.2.
  • W. Chen, J. Chen, P. Qin, X. Yan, and W. Y. Wang (2019) Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3696–3709. External Links: Link, Document Cited by: Table 3, §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. NAACL. Cited by: §1, §4, §4.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pp. 13042–13054. Cited by: §1.
  • M. Eric and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414. Cited by: §2.2.
  • J. Gao, M. Galley, and L. Li (2019) Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §1.
  • D. Ham, J. Lee, Y. Jang, and K. Kim (2020) End-to-end neural pipeline for goal-oriented dialogue system using gpt-2. ACL. Cited by: Table 4, §3.2, §3.3, §4.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §2.3.
  • E. Hosseini-Asl, B. McCann, C. Wu, S. Yavuz, and R. Socher (2020) A simple language model for task-oriented dialogue. arXiv preprint arXiv:2005.00796. Cited by: footnote 10.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §4.
  • S. Kim, M. Galley, C. Gunasekara, S. Lee, A. Atkinson, B. Peng, H. Schulz, J. Gao, J. Li, M. Adada, et al. (2019) The eighth dialog system technology challenge. arXiv preprint arXiv:1911.06394. Cited by: §2.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.2.
  • S. Lee, Q. Zhu, R. Takanobu, X. Li, Y. Zhang, Z. Zhang, J. Li, B. Peng, X. Li, M. Huang, and J. Gao (2019) ConvLab: multi-domain end-to-end dialog system platform. CoRR abs/1904.08637. External Links: Link, 1904.08637 Cited by: §1, §4.
  • W. Lei, X. Jin, M. Kan, Z. Ren, X. He, and D. Yin (2018) Sequicity: simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1, Table 2, Table 4, §3.1, §3.1, §3.2, Table 6, §4.
  • C. Li, X. Gao, Y. Li, X. Li, B. Peng, Y. Zhang, and J. Gao (2020) Optimus: organizing sentences via pre-trained modeling of a latent space. arXiv preprint arXiv:2004.04092. Cited by: §4.
  • X. Li, Y. Chen, L. Li, J. Gao, and A. Celikyilmaz (2017) End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
  • S. Mehri, T. Srinivasan, and M. Eskenazi (2019) Structured fusion networks for dialog. arXiv preprint arXiv:1907.10016. Cited by: Table 3, Table 4, §4.
  • N. Mrkšić, D. O. Séaghdha, T. Wen, B. Thomson, and S. Young (2016) Neural belief tracker: data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777. Cited by: §2.2.
  • J. Pei, P. Ren, and M. de Rijke (2019) A modular task-oriented dialogue system using a neural mixture-of-experts. arXiv preprint arXiv:1907.05346. Cited by: Table 3, §4.
  • B. Peng, C. Zhu, C. Li, X. Li, J. Li, M. Zeng, and J. Gao (2020) Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328. Cited by: §4.
  • S. Peng, X. Huang, Z. Lin, F. Ji, H. Chen, and Y. Zhang (2019) Teacher-student framework enhanced multi-domain dialogue generation. arXiv preprint arXiv:1908.07137. Cited by: Table 4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §2.1, §4, §4.
  • A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan (2019)

    Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset

    arXiv preprint arXiv:1909.05855. Cited by: §2.2.
  • S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, K. Shuster, E. M. Smith, et al. (2020) Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. Cited by: §4.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §2.2.
  • S. Shukla, L. Liden, S. Shayandeh, E. Kamal, J. Li, M. Mazzola, T. Park, B. Peng, and J. Gao (2020) Conversation learner–a machine teaching tool for building dialog managers for task-oriented dialog systems. arXiv preprint arXiv:2004.04305. Cited by: §1, §1, §2.3, §2.3, §3.4.
  • P. Y. Simard, S. Amershi, D. M. Chickering, A. E. Pelton, S. Ghorashi, C. Meek, G. Ramos, J. Suh, J. Verwey, M. Wang, and J. Wernsing (2017) Machine teaching: A new paradigm for building machine learning systems. CoRR abs/1707.06742. External Links: Link, 1707.06742 Cited by: §2.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • T. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2016) A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562. Cited by: §1, §3.1, §3.2, §4.
  • J. D. Williams, K. Asadi, and G. Zweig (2017)

    Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning

    arXiv preprint arXiv:1702.03274. Cited by: §1.
  • J. D. Williams and L. Liden (2017) Demonstration of interactive teaching for end-to-end dialog control with hybrid code networks. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 82–85. Cited by: §2.3.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019a) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §2.2.
  • T. Wolf, V. Sanh, J. Chaumond, and C. Delangue (2019b) TransferTransfo: A transfer learning approach for neural network based conversational agents. CoRR abs/1901.08149. External Links: Link, 1901.08149 Cited by: §4.
  • C. Wu, S. Hoi, R. Socher, and C. Xiong (2020a) ToD-BERT: pre-trained natural language understanding for task-oriented dialogues. Cited by: Table 1, §4.
  • Q. Wu, Y. Zhang, Y. Li, and Z. Yu (2019) Alternating recurrent dialog model with large-scale pre-trained language models. arXiv preprint arXiv:1910.03756. Cited by: Table 2, Table 3, §3.1, §4.
  • Z. Wu, M. Galley, C. Brockett, Y. Zhang, X. Gao, C. Quirk, R. Koncel-Kedziorski, J. Gao, H. Hajishirzi, M. Ostendorf, and B. Dolan (2020b) A controllable model of grounded response generation. arXiv preprint arXiv:2005.00613. Cited by: §4.
  • R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi (2019) Defending against neural fake news. In Advances in Neural Information Processing Systems, Cited by: §4.
  • Y. Zhang, Z. Ou, and Z. Yu (2019a) Task-oriented dialog systems that consider multiple appropriate responses under the same context. arXiv preprint arXiv:1911.10484. Cited by: §1, Table 3, Table 4, §3.1, §3.2, Table 7, Table 8, §4.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2019b) DialoGPT: large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. Cited by: §2.1, §4.
  • T. Zhao and M. Eskenazi (2016) Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. arXiv preprint arXiv:1606.02560. Cited by: §1.
  • T. Zhao, K. Xie, and M. Eskenazi (2019) Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models. arXiv preprint arXiv:1902.08858. Cited by: Table 3, §4.
  • Q. Zhu, Z. Zhang, Y. Fang, X. Li, R. Takanobu, J. Li, B. Peng, J. Gao, X. Zhu, and M. Huang (2020) ConvLab-2: an open-source toolkit for building, evaluating, and diagnosing dialogue systems. CoRR abs/2002.04793. External Links: Link, 2002.04793 Cited by: §1.
  • X. Zhu (2015) Machine teaching: an inverse problem to machine learning and an approach toward optimal education. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    Cited by: §1, §2.3.