1 Introduction and Related Work
The current approaches to build task-oriented dialog systems still require a substantial amount of annotations and therefore are labor-intensive. On the other hand, large-scale pre-trained language models such as BERT Devlin et al. (2018) and GPT Radford et al. (2019) have achieved great success on various NLP tasks, which proves the effectiveness of pre-training. There have been several attempts to directly apply these language models to dialog systems. For example, Transfer-Transfo Wolf et al. (2019) fine-tuned GPT on the Persona-Chat dataset (Zhang et al., 2018b) and achieved the state-of-the-art performance on chitchat dialog generation. Budzianowski and Vulic (2019) adopted the structure of Transfe-Transfo, further pre-trained GPT-2 with a collection of task-oriented dialogs and obtained good results on downstream tasks. DialoGPT Zhang et al. (2019) utilizes a large Reddit corpus to further pre-train GPT-2 Zhang et al. (2019). All of these studies pointed to a promising direction towards building dialog systems with large-scale language models and less supervision.
However, these languages models applied on dialog systems still have some limitations. First, further pretraining language models for dialog systems requires a huge amount of training corpora, but a diverse collection of high-quality dialog datasets is always hard to obtain. Second, dialogs consist of multiple-parties and each party has different language styles. However, most previous dialog systems only utilize one single language model to perform the dialog generation for all parties. Next, dialogs are always of variable lengths, and therefore the fixed-length position embedding in GPT results in sub-optimal results. Additionally, dialogs involve a large amount of commonsense knowledge which can be missing in small-size language models. Furthermore, natural dialogs require good understanding of the context, yet contextual information is hard to preserve in language models.
To tackle these issues, we propose Pre-trained Role Alternating Language model (PRAL), a language model specifically designed for dialog generation. To begin with, we collect and process 13 dialog datasets, ranging from TV transcripts to pizza ordering dialogs, to enrich the pretraining data with high-quality dialog corpora. Second, we adopt ARDM proposed in Wu et al. (2019) and use two separate GPT-2 to model the two speakers in the dialog. Next, we apply Start Position Randomization (SPR) to cope with the variable lengths in dialogs, which also prevents the language model from binding the position index with the text information. Additionally, we utilize the original large-scale GPT-2 to perform knowledge distillation and incorporate common sense knowledge into the dialog generation. Finally, we re-weight each utterance with discount factors and emphasize on the later part in a dialog to better incorporate contextual information.We evaluate PRAL on three task-oriented datasets (CamRest676, Multiwoz and PersuasionForGood), and reach the state-of-the-art results without using any annotation.
In summary, we process and present a collection of high-quality dialog datasets suitable for pre-training large-scale language models on dialog systems. We also propose PRAL and design several effective techniques to improve the dialog model pretraining. Our pretrained model leads to an increase on success rate on CamRest676 and MultiWOZ dataset, and an improvement on the coherence and diversity scores by 50% on PersuasionForGood.
2 PretrainDial Dataset for Pretraining
Clean dialog datasets that are big enough to pre-train language models for dialog systems are difficult to find. Therefore, we propose PretrainDial, a large-scale multi-domain dialog corpus suitable for pretraining. We carefully selected 13 existing dialog corpora listed in Appendix A.2, ranging from chitchat such TV transcripts to task-oriented dialogs, and process them in a unified form. Table. 1 shows the statistics of PretrainDial.
We adopt the architecture from “Alternating Roles Dialog Model” (ARDM) Wu et al. (2019) which uses two language models for the user and system separately. Each language model is initialized with a small GPT-2 Radford et al. (2019). In this section, we will briefly introduce ARDM and describe our approaches to improving existing language models. Figure 1 shows the main structure of PRAL
3.1 Alternating Roles Dialog Model
We first briefly talk about Alternating Roles Dialog Model (ARDM)Wu et al. (2019). The basic idea behind ARDM is to simultaneously model the user and system with two separate GPT-2 to capture the different language styles. A dialog can be considered as a sequence of utterances , where is the total number of turns. We use and
to represent the probability of the user utterance and system utterance. The entire dialog distribution is defined as:
By maximizing the likelihood in Equation (1), ARDM successfully models the user and system at the same time. However, ARDM did not employ additional pre-training on the dialog corpus. In contrast, we further pre-train ARDM on our collected dialog corpus. In addition, we propose three effective techniques to help pre-training.
3.2 Start Position Randomization
We use GPT-2 as the language model in PRAL. GPT-2 uses position embedding to encode the location information for each token. It supports the maximum position of 1024, and the position index always starts from 0. However, since most dialogs contain less than 1024 tokens, most vectors in the positional embedding would remain zero and not be updated during pre-training. Besides, since position embedding only provides the location information for each token, fixing the start position to 0 will bond certain text with certain position index. For example, “hi” is always bonded with index 1 as “hi” usually appears at the beginning. The model is likely to overfit on the positional embeddings near the start.
To address these issues, we propose to use Start Position Randomization (SPR). Denoting as the total number of tokens in a dialog, then the maximum start position index is . We randomize the start position to be any number between 0 to . It would disentangle the positional information from the textual meaning and force the model to update all the positional embeddings.
3.3 Teacher GPT
All neural networks suffer from the catastrophic forgetting problem(Kirkpatrick et al., 2016). Since we have trained GPT-2 with the new dialog corpus and obtained a new language model, the new model is at risk in forgetting the prior knowledge from the original GPT-2.Therefore, we apply a simple approach as continual learning Parisi et al. (2018) to mitigate the problem. In detail, we use another fixed GPT-2 as the teacher network to preserve the knowledge. To do so, we use the distillation loss Hinton et al. (2015) which calculates the KL divergence between our model and the fixed GPT-2, :
In our best model, we use GPT-2 large as the teacher language model to distill more knowledge. Because applying a larger GPT-2 requires more computational resources, we also conduct the ablation of using GPT-2 small as the teacher language model in the experiments. The result suggests that regardless of the size of the GPT-2, our method helps in the dialog model pretraining process.
3.4 History Discount
In each dialog, utterances in the latter part should have more importance because they aggregate more complex contextual information, which can help the model to learn the consistency in context. Therefore, we introduce discount factor to re-weight the importance of each utterance based on the turn number. For a dialog with a total of utterances and the current utterance index , the language model loss is weighted by . By multiplying the discount factor , the model has stronger ability to predict complex context and generate more consistent responses.
We use the loss for language modeling to optimize the model, as shown below in Equation 1,
CE here denotes cross entropy loss. is the total number of utterance in a dialogue, and is the total number of tokens in the utterance. For the loss of each utterance in the dialogue, it is weighted by the discount factor described in section 3.4
. We go over each word in the utterance, except for the last one, to compute its cross-entropy loss between the output probability distributionand its ground truth .
Our final loss will be a combination of the language model loss and KL divergence:
The factor is used for better optimization and will be decreasing exponentially as the number of iteration increases, i.e. .
We pre-train PRAL on PretrainDial. For the pre-training detail, please refer to Appendix A.1. To show the generalizability of PRAL, we evaluate it on three task-oriented dialog tasks, CamRest676, MultiWOZ and PersuasionforGood.
CamRest676 Wen et al. (2017) is a small dialog dataset for restaurant recommendation in Cambridge. There are 680 dialogues where users look for restaurants based on their preference on food, price range and area. Table. 1(a) shows our results on CamRest676. We use BLEU-4 metrics to measure the quality of generated sentences, and Success F1 to evaluate the responses on specific slots, such as address, phone, postcode. Sequicity is the state-of-the-art method in task-oriented dialog tasks that utilizes annotations in a traditional fashion. We found that PRAL is able to beat all the baselines on both BLEU-4 and Success F1 including the state-of-the-art ARDM model. One thing to note is that PRAL doesn’t need any annotation. This suggests that PRAL leverages external knowledge from the pre-training process, and the proposed techniques are effective for dialog language model pretraining.
We also perform ablation study on CamRest676 and find that the Teacher GPT plays the most important role. This suggests knowledge distillation from the large pre-training is critical to good performance. Our model also outperforms the DialoGPT baseline, which utilizes a much larger Reddit dataset (30G) in pretraining compared to the much smaller but higher-quality PretrainDial (300MB) data used in PRAL . This suggests the quality rather than the size of the dataset matters.
MultiWOZ Budzianowski et al. (2018) is a large-scale multi-domain dataset, which contains around 10k dialogues covering various domains. We evaluate the models with on BLEU-4, Inform Rate and Success Rate which measures if the system provides the requested information. Table. 1(b) shows our results. We first compare our model to the attention seq2seq model used as the baseline in Multiwoz (Budzianowski et al., 2018). We then compare our model with HDSA (Budzianowski et al., 2018) and LaRL Zhao and Kawahara (2019). Our model outperforms or achieve comparable results with HDSA and LaRL. PRAL achieves a much higher BLUE-4 score than LaRL (improve 68.8%). PRAL outperforms ARDM in all metrics. It is worth noting our model does not use any annotation.
PersuasionforGood We also evaluate our method on a non-collaborative dialog dataset, Persuasion for good Wang et al. (2019). In PersuasionforGood, a persuader tries to persuade another user to donate money. There are a total of 1,017 dialogues. Unlike CamRest676 and Multiwoz, the language in PersuasionforGood dataset is so diverse that BLEU-4 scores of all of the models on PersuasionforGood are too low to be a scientific metrics. Therefore, we use BLEU-1 and BLEU-2 instead. Compared with ARDM, our model achieves a significant higher score on BLUE metrics, especially on BLEU-2 (63% up). We also conduct human evaluation between ARDM and our model. We ask human evaluator that how much they are willing to donate after the conversation and acquire their ratings on the dialog system in terms of fluency, logic, coherence and diversity. The result of human evaluation suggests that PRAL outperforms ARDM on all the metrics and is a better language model for dialog system in general. For examples of the persuasion process, please refer to Appendices A.3.
We propose PRAL, a large pre-trained language model for task-oriented dialog systems. We successfully incorporated methods that are designed for large pre-trained language models into PRAL and achieved good performances on three downstream tasks. Specifically, we designed start position randomization, knowledge distillation and history discount to improve pre-training performance. The model generates more fluent, coherent, diverse and logical dialogs according to human evaluation results. The resulting dialog systems also obtained more donation. We also clean a high quality dialog dataset for pre-training process. Our work is the first step towards a coherent and engaging dialog model that generalize to different dialog tasks.
- Frames: a corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057. Cited by: Table 3.
- Hello, it’s GPT-2 - how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. CoRR abs/1907.05774. External Links: Cited by: §1.
- MultiWOZ - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. CoRR abs/1810.00278. External Links: Cited by: §4.
- Taskmaster-1: toward a realistic and diverse dialog dataset. ArXiv abs/1909.05358. Cited by: Table 3.
-  ChitChat dataset. Note: https://github.com/BYU-PCCL/chitchat-dataset Cited by: Table 3.
-  Friends corpus. Note: https://www.kaggle.com/vinayvk/friends-series-data-set Cited by: Table 3.
- Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In CMCL@ACL, Cited by: Table 3.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
Talking to myself: self-dialogues as data for conversational agents. ArXiv abs/1809.06641. Cited by: Table 3.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.3.
- Overcoming catastrophic forgetting in neural networks. CoRR abs/1612.00796. External Links: Cited by: §3.3.
- DailyDialog: a manually labelled multi-turn dialogue dataset. In IJCNLP, Cited by: Table 3.
- Continual lifelong learning with neural networks: A review. CoRR abs/1802.07569. External Links: Cited by: §3.3.
- Language models are unsupervised multitask learners. NLP. Cited by: §A.1, §1, §3.
- Coached conversational preference elicitation: a case study in understanding movie preferences. In SIGDIAL 2019, Cited by: Table 3.
- Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset. ArXiv abs/1909.05855. Cited by: Table 3.
- Reddit corpus. Note: https://zissou.infosci.cornell.edu/convokit/documentation/subreddit.html Cited by: Table 3.
- Persuasion for good: towards a personalized persuasive dialogue system for social good. In ACL, Cited by: §4.
- A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 438–449. External Links: Cited by: §4.
TransferTransfo: A transfer learning approach for neural network based conversational agents. CoRR abs/1901.08149. External Links: Cited by: §1.
- Alternating recurrent dialog model with large-scale pre-trained language models. arXiv preprint arXiv:1910.03756. Cited by: A Tailored Pre-Training Model for Task-Oriented Dialog Generation, §1, §3.1, §3.
- Conversations gone awry: detecting early signs of conversational failure. arXiv preprint arXiv:1805.05345. Cited by: Table 3.
- Personalizing dialogue agents: I have a dog, do you have pets too?. CoRR abs/1801.07243. External Links: Cited by: Table 3, §1.
- DialoGPT: large-scale generative pre-training for conversational response generation. External Links: Cited by: §1.
- Effective incorporation of speaker information in utterance encoding in dialog. arXiv preprint arXiv:1907.05599. Cited by: §4.
Appendix A Appendices
a.1 Training Details
We adopt the architecture from ARDM by using two language model to simulate the user and the system. For the language models, we adopt pre-trained language model GPT-2 small Radford et al. (2019). For teacher neural model, we use GPT-2 large Radford et al. (2019). We follow the same special format in GPT-2 as the “trigger” so the model can zero-shot dialog response. In detail, we use “A:” and “B:” as user role prefix and use “\n\n\n” as suffix. We use AdamW optimizer. The number of warm-up steps is set to be 10 percent of the total training step. The learning rate is set to be . For the calculation of loss, we set to be 0.1 and set to be 0.9999. The discount factor is set to be 0.95.
a.2 Dataset sources
Our dataset contains high-quality dialogues which are selected from other 13 datasets listed in Table 3.
|1||CCPE-M dataset Radlinski et al. (2019)|
|3||Conversations Gone Awry Dataset(Wiki) Zhang et al. (2018a)|
|4||Conversations Gone Awry Dataset(CMV) Zhang et al. (2018a)|
|5||Cornell Movie-Dialogs Corpus Danescu-Niculescu-Mizil and Lee (2011)|
|6||DailyDialog Li et al. (2017)|
|7||Frames Dataset Asri et al. (2017)|
|8||PersonaChat ConvAI2 Dataset Zhang et al. (2018b)|
|9||The Schema-Guided Dialogue Dataset Rastogi et al. (2019)|
|10||Self-dialogue Corpus Fainberg et al. (2018)|
|11||Friends Series Dataset Challenge|
|12||Taskmaster-1 Byrne et al. (2019)|
|13||Reddit Corpus Reddit (2019)|
a.3 Persuasion Example
One dialogue example is shown in Table 4.
|SYS||Good morning, how are you doing today?|
|SYS||Have you ever heard of Save the Children?|
|USR||I have not heard of it. Would you like to give an introduction?|
|SYS||It is a charity that helps children in need around the world.|
|SYS||I would like to give a little bit more information.|
|USR||Sure. Go ahead.|
|SYS||Save the Children is an international non-governmental organization that promotes children’s rights, provides relief and helps support children in developing countries.|
|USR||It is a good organization. I am glad to know it.|
|SYS||Yes, they are a very professional organization. Would you be interested in donating some of your task money to this charity?|
|USR||Sure. I would like to donate some money to it.|
|SYS||Thank you so much!|
|SYS||How much do you like to donate to the charity now? Your donation will be directly deducted from your task payment.|
|USR||I would like to give $2 to them.|
|SYS||Thank you so much!|
|USR||You are welcome|
|SYS||I hope you have a great day!|