A Tailored Pre-Training Model for Task-Oriented Dialog Generation

04/24/2020 ∙ by Jing Gu, et al. ∙ University of California-Davis 0

The recent success of large pre-trained language models such as BERT and GPT-2 has suggested the effectiveness of incorporating language priors in downstream dialog generation tasks. However, the performance of pre-trained models on the dialog task is not as optimal as expected. In this paper, we propose a Pre-trained Role Alternating Language model (PRAL), designed specifically for task-oriented conversational systems. We adopted (Wu et al., 2019) that models two speakers separately. We also design several techniques, such as start position randomization, knowledge distillation, and history discount to improve pre-training performance. We introduce a task-oriented dialog pretraining dataset by cleaning 13 existing data sets. We test PRAL on three different downstream tasks. The results show that PRAL performs better or on par with state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

The current approaches to build task-oriented dialog systems still require a substantial amount of annotations and therefore are labor-intensive. On the other hand, large-scale pre-trained language models such as BERT Devlin et al. (2018) and GPT Radford et al. (2019) have achieved great success on various NLP tasks, which proves the effectiveness of pre-training. There have been several attempts to directly apply these language models to dialog systems. For example, Transfer-Transfo Wolf et al. (2019) fine-tuned GPT on the Persona-Chat dataset (Zhang et al., 2018b) and achieved the state-of-the-art performance on chitchat dialog generation. Budzianowski and Vulic (2019) adopted the structure of Transfe-Transfo, further pre-trained GPT-2 with a collection of task-oriented dialogs and obtained good results on downstream tasks. DialoGPT Zhang et al. (2019) utilizes a large Reddit corpus to further pre-train GPT-2 Zhang et al. (2019). All of these studies pointed to a promising direction towards building dialog systems with large-scale language models and less supervision.

Figure 1: The illustration of PRAL. There are two language models for the user and the system, respectively. Teacher GPT is used to provide a better supervision to them. and denote the losses for the language modeling and the KL divergence.

However, these languages models applied on dialog systems still have some limitations. First, further pretraining language models for dialog systems requires a huge amount of training corpora, but a diverse collection of high-quality dialog datasets is always hard to obtain. Second, dialogs consist of multiple-parties and each party has different language styles. However, most previous dialog systems only utilize one single language model to perform the dialog generation for all parties. Next, dialogs are always of variable lengths, and therefore the fixed-length position embedding in GPT results in sub-optimal results. Additionally, dialogs involve a large amount of commonsense knowledge which can be missing in small-size language models. Furthermore, natural dialogs require good understanding of the context, yet contextual information is hard to preserve in language models.

To tackle these issues, we propose Pre-trained Role Alternating Language model (PRAL), a language model specifically designed for dialog generation. To begin with, we collect and process 13 dialog datasets, ranging from TV transcripts to pizza ordering dialogs, to enrich the pretraining data with high-quality dialog corpora. Second, we adopt ARDM proposed in Wu et al. (2019) and use two separate GPT-2 to model the two speakers in the dialog. Next, we apply Start Position Randomization (SPR) to cope with the variable lengths in dialogs, which also prevents the language model from binding the position index with the text information. Additionally, we utilize the original large-scale GPT-2 to perform knowledge distillation and incorporate common sense knowledge into the dialog generation. Finally, we re-weight each utterance with discount factors and emphasize on the later part in a dialog to better incorporate contextual information.We evaluate PRAL on three task-oriented datasets (CamRest676, Multiwoz and PersuasionForGood), and reach the state-of-the-art results without using any annotation.

In summary, we process and present a collection of high-quality dialog datasets suitable for pre-training large-scale language models on dialog systems. We also propose PRAL and design several effective techniques to improve the dialog model pretraining. Our pretrained model leads to an increase on success rate on CamRest676 and MultiWOZ dataset, and an improvement on the coherence and diversity scores by 50% on PersuasionForGood.

Dataset Statistics # Domains 13 # Dialogues 142,298 Avg. turns per dialogue 12.66 Avg. tokens per turn 11.78 Avg. tokens per dialogue 149.25 Total unique tokens 108,106

Table 1: Statistics of our dataset

2 PretrainDial Dataset for Pretraining

Clean dialog datasets that are big enough to pre-train language models for dialog systems are difficult to find. Therefore, we propose PretrainDial, a large-scale multi-domain dialog corpus suitable for pretraining. We carefully selected 13 existing dialog corpora listed in Appendix A.2, ranging from chitchat such TV transcripts to task-oriented dialogs, and process them in a unified form. Table.  1 shows the statistics of PretrainDial.

3 Methods

We adopt the architecture from “Alternating Roles Dialog Model” (ARDM)  Wu et al. (2019) which uses two language models for the user and system separately. Each language model is initialized with a small GPT-2 Radford et al. (2019). In this section, we will briefly introduce ARDM and describe our approaches to improving existing language models. Figure 1 shows the main structure of PRAL

3.1 Alternating Roles Dialog Model

We first briefly talk about Alternating Roles Dialog Model (ARDM)Wu et al. (2019). The basic idea behind ARDM is to simultaneously model the user and system with two separate GPT-2 to capture the different language styles. A dialog can be considered as a sequence of utterances , where is the total number of turns. We use and

to represent the probability of the user utterance and system utterance. The entire dialog distribution is defined as:


By maximizing the likelihood in Equation (1), ARDM successfully models the user and system at the same time. However, ARDM did not employ additional pre-training on the dialog corpus. In contrast, we further pre-train ARDM on our collected dialog corpus. In addition, we propose three effective techniques to help pre-training.

3.2 Start Position Randomization

We use GPT-2 as the language model in PRAL. GPT-2 uses position embedding to encode the location information for each token. It supports the maximum position of 1024, and the position index always starts from 0. However, since most dialogs contain less than 1024 tokens, most vectors in the positional embedding would remain zero and not be updated during pre-training. Besides, since position embedding only provides the location information for each token, fixing the start position to 0 will bond certain text with certain position index. For example, “hi” is always bonded with index 1 as “hi” usually appears at the beginning. The model is likely to overfit on the positional embeddings near the start.

To address these issues, we propose to use Start Position Randomization (SPR). Denoting as the total number of tokens in a dialog, then the maximum start position index is . We randomize the start position to be any number between 0 to . It would disentangle the positional information from the textual meaning and force the model to update all the positional embeddings.

3.3 Teacher GPT

All neural networks suffer from the catastrophic forgetting problem

(Kirkpatrick et al., 2016). Since we have trained GPT-2 with the new dialog corpus and obtained a new language model, the new model is at risk in forgetting the prior knowledge from the original GPT-2.Therefore, we apply a simple approach as continual learning Parisi et al. (2018) to mitigate the problem. In detail, we use another fixed GPT-2 as the teacher network to preserve the knowledge. To do so, we use the distillation loss Hinton et al. (2015) which calculates the KL divergence between our model and the fixed GPT-2, :

In our best model, we use GPT-2 large as the teacher language model to distill more knowledge. Because applying a larger GPT-2 requires more computational resources, we also conduct the ablation of using GPT-2 small as the teacher language model in the experiments. The result suggests that regardless of the size of the GPT-2, our method helps in the dialog model pretraining process.

3.4 History Discount

In each dialog, utterances in the latter part should have more importance because they aggregate more complex contextual information, which can help the model to learn the consistency in context. Therefore, we introduce discount factor to re-weight the importance of each utterance based on the turn number. For a dialog with a total of utterances and the current utterance index , the language model loss is weighted by . By multiplying the discount factor , the model has stronger ability to predict complex context and generate more consistent responses.

3.5 Optimization

We use the loss for language modeling to optimize the model, as shown below in Equation 1,


CE here denotes cross entropy loss. is the total number of utterance in a dialogue, and is the total number of tokens in the utterance. For the loss of each utterance in the dialogue, it is weighted by the discount factor described in section 3.4

. We go over each word in the utterance, except for the last one, to compute its cross-entropy loss between the output probability distribution

and its ground truth .

Our final loss will be a combination of the language model loss and KL divergence:


The factor is used for better optimization and will be decreasing exponentially as the number of iteration increases, i.e. .

Model BLEU-4 Success F1
Sequicity 21.4 0.852
Sequicity (w/o RL) 22.9 0.821
GPT-2-finetune 21.8 0.851
DialoGPT 25.2 0.861
ARDM 26.2 0.864
PRAL 27.3 0.870
    - w/ Teacher GPT(small) 26.9 0.869
    - w/o Teacher GPT 25.0 0.865
    - w/o loss discount 27.0 0.867
    - w/o SPR 26.6 0.869
(a) Results on CamRest676 dataset.
Model Supervision BLEU-4 Inform Success
Dialog State Dialog Act
Human - - - 0.989 0.965
Baseline 18.9 0.825 0.729
HDSA 23.6 0.877 0.734
LaRL 12.8 0.828 0.792
ARDM 20.6 0.874 0.728
PRAL 21.6 0.875 0.742
(b) Results on MultiWOZ dataset
Perplexity BLEU-1 BLEU-2 Fluency Logic Coherence Diversity Overall Avg.Donation
ARDM 10.1 16.5 6.44 0.39 0.41 0.37 0.27 0.18 0.62
PRAL 10.3 17.3 10.9 0.61 0.59 0.63 0.73 0.82 0.99
(c) PersuasionforGood. Automatic Evaluation and Human Evaluation Results
Table 2: Evaluation on three datasets

4 Experiments

We pre-train PRAL on PretrainDial. For the pre-training detail, please refer to Appendix A.1. To show the generalizability of PRAL, we evaluate it on three task-oriented dialog tasks, CamRest676, MultiWOZ and PersuasionforGood.

CamRest676 Wen et al. (2017) is a small dialog dataset for restaurant recommendation in Cambridge. There are 680 dialogues where users look for restaurants based on their preference on food, price range and area. Table. 1(a) shows our results on CamRest676. We use BLEU-4 metrics to measure the quality of generated sentences, and Success F1 to evaluate the responses on specific slots, such as address, phone, postcode. Sequicity is the state-of-the-art method in task-oriented dialog tasks that utilizes annotations in a traditional fashion. We found that PRAL is able to beat all the baselines on both BLEU-4 and Success F1 including the state-of-the-art ARDM model. One thing to note is that PRAL doesn’t need any annotation. This suggests that PRAL leverages external knowledge from the pre-training process, and the proposed techniques are effective for dialog language model pretraining.

We also perform ablation study on CamRest676 and find that the Teacher GPT plays the most important role. This suggests knowledge distillation from the large pre-training is critical to good performance. Our model also outperforms the DialoGPT baseline, which utilizes a much larger Reddit dataset (30G) in pretraining compared to the much smaller but higher-quality PretrainDial (300MB) data used in PRAL . This suggests the quality rather than the size of the dataset matters.

MultiWOZ Budzianowski et al. (2018) is a large-scale multi-domain dataset, which contains around 10k dialogues covering various domains. We evaluate the models with on BLEU-4, Inform Rate and Success Rate which measures if the system provides the requested information. Table. 1(b) shows our results. We first compare our model to the attention seq2seq model used as the baseline in Multiwoz (Budzianowski et al., 2018). We then compare our model with HDSA (Budzianowski et al., 2018) and LaRL Zhao and Kawahara (2019). Our model outperforms or achieve comparable results with HDSA and LaRL. PRAL achieves a much higher BLUE-4 score than LaRL (improve 68.8%). PRAL outperforms ARDM in all metrics. It is worth noting our model does not use any annotation.

PersuasionforGood We also evaluate our method on a non-collaborative dialog dataset, Persuasion for good Wang et al. (2019). In PersuasionforGood, a persuader tries to persuade another user to donate money. There are a total of 1,017 dialogues. Unlike CamRest676 and Multiwoz, the language in PersuasionforGood dataset is so diverse that BLEU-4 scores of all of the models on PersuasionforGood are too low to be a scientific metrics. Therefore, we use BLEU-1 and BLEU-2 instead. Compared with ARDM, our model achieves a significant higher score on BLUE metrics, especially on BLEU-2 (63% up). We also conduct human evaluation between ARDM and our model. We ask human evaluator that how much they are willing to donate after the conversation and acquire their ratings on the dialog system in terms of fluency, logic, coherence and diversity. The result of human evaluation suggests that PRAL outperforms ARDM on all the metrics and is a better language model for dialog system in general. For examples of the persuasion process, please refer to Appendices A.3.

5 Conclusion

We propose PRAL, a large pre-trained language model for task-oriented dialog systems. We successfully incorporated methods that are designed for large pre-trained language models into PRAL and achieved good performances on three downstream tasks. Specifically, we designed start position randomization, knowledge distillation and history discount to improve pre-training performance. The model generates more fluent, coherent, diverse and logical dialogs according to human evaluation results. The resulting dialog systems also obtained more donation. We also clean a high quality dialog dataset for pre-training process. Our work is the first step towards a coherent and engaging dialog model that generalize to different dialog tasks.


  • L. E. Asri, H. Schulz, S. Sharma, J. Zumer, J. Harris, E. Fine, R. Mehrotra, and K. Suleman (2017) Frames: a corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057. Cited by: Table 3.
  • P. Budzianowski and I. Vulic (2019) Hello, it’s GPT-2 - how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. CoRR abs/1907.05774. External Links: Link, 1907.05774 Cited by: §1.
  • P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gasic (2018) MultiWOZ - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. CoRR abs/1810.00278. External Links: Link, 1810.00278 Cited by: §4.
  • B. Byrne, K. Krishnamoorthi, C. Sankar, A. Neelakantan, D. Duckworth, S. Yavuz, B. Goodrich, A. Dubey, A. Cedilnik, and K. Kim (2019) Taskmaster-1: toward a realistic and diverse dialog dataset. ArXiv abs/1909.05358. Cited by: Table 3.
  • [5] C. Challenge ChitChat dataset. Note: https://github.com/BYU-PCCL/chitchat-dataset Cited by: Table 3.
  • [6] K. Challenge Friends corpus. Note: https://www.kaggle.com/vinayvk/friends-series-data-set Cited by: Table 3.
  • C. Danescu-Niculescu-Mizil and L. Lee (2011) Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In CMCL@ACL, Cited by: Table 3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • J. Fainberg, B. Krause, M. Dobre, M. Damonte, E. Kahembwe, D. Duma, B. L. Webber, and F. Fancellu (2018)

    Talking to myself: self-dialogues as data for conversational agents

    ArXiv abs/1809.06641. Cited by: Table 3.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.3.
  • J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2016) Overcoming catastrophic forgetting in neural networks. CoRR abs/1612.00796. External Links: Link, 1612.00796 Cited by: §3.3.
  • Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) DailyDialog: a manually labelled multi-turn dialogue dataset. In IJCNLP, Cited by: Table 3.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2018) Continual lifelong learning with neural networks: A review. CoRR abs/1802.07569. External Links: Link, 1802.07569 Cited by: §3.3.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. NLP. Cited by: §A.1, §1, §3.
  • F. Radlinski, K. Balog, B. Byrne, and K. Krishnamoorthi (2019) Coached conversational preference elicitation: a case study in understanding movie preferences. In SIGDIAL 2019, Cited by: Table 3.
  • A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan (2019) Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset. ArXiv abs/1909.05855. Cited by: Table 3.
  • Reddit (2019) Reddit corpus. Note: https://zissou.infosci.cornell.edu/convokit/documentation/subreddit.html Cited by: Table 3.
  • X. Wang, W. Shi, R. Kim, Y. Oh, S. Yang, J. Zhang, and Z. Yu (2019) Persuasion for good: towards a personalized persuasive dialogue system for social good. In ACL, Cited by: §4.
  • T. Wen, D. Vandyke, N. Mrkšić, M. Gašić, L. M. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2017) A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 438–449. External Links: Link Cited by: §4.
  • T. Wolf, V. Sanh, J. Chaumond, and C. Delangue (2019)

    TransferTransfo: A transfer learning approach for neural network based conversational agents

    CoRR abs/1901.08149. External Links: Link, 1901.08149 Cited by: §1.
  • Q. Wu, Y. Zhang, Y. Li, and Z. Yu (2019) Alternating recurrent dialog model with large-scale pre-trained language models. arXiv preprint arXiv:1910.03756. Cited by: A Tailored Pre-Training Model for Task-Oriented Dialog Generation, §1, §3.1, §3.
  • J. Zhang, J. P. Chang, C. Danescu-Niculescu-Mizil, L. Dixon, Y. Hua, N. Thain, and D. Taraborelli (2018a) Conversations gone awry: detecting early signs of conversational failure. arXiv preprint arXiv:1805.05345. Cited by: Table 3.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018b) Personalizing dialogue agents: I have a dog, do you have pets too?. CoRR abs/1801.07243. External Links: Link, 1801.07243 Cited by: Table 3, §1.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2019) DialoGPT: large-scale generative pre-training for conversational response generation. External Links: 1911.00536 Cited by: §1.
  • T. Zhao and T. Kawahara (2019) Effective incorporation of speaker information in utterance encoding in dialog. arXiv preprint arXiv:1907.05599. Cited by: §4.

Appendix A Appendices

a.1 Training Details

We adopt the architecture from ARDM by using two language model to simulate the user and the system. For the language models, we adopt pre-trained language model GPT-2 small Radford et al. (2019). For teacher neural model, we use GPT-2 large Radford et al. (2019). We follow the same special format in GPT-2 as the “trigger” so the model can zero-shot dialog response. In detail, we use “A:” and “B:” as user role prefix and use “\n\n\n” as suffix. We use AdamW optimizer. The number of warm-up steps is set to be 10 percent of the total training step. The learning rate is set to be . For the calculation of loss, we set to be 0.1 and set to be 0.9999. The discount factor is set to be 0.95.

a.2 Dataset sources

Our dataset contains high-quality dialogues which are selected from other 13 datasets listed in Table 3.

1 CCPE-M dataset Radlinski et al. (2019)
2 ChitChat-Dataset Challenge
3 Conversations Gone Awry Dataset(Wiki) Zhang et al. (2018a)
4 Conversations Gone Awry Dataset(CMV) Zhang et al. (2018a)
5 Cornell Movie-Dialogs Corpus Danescu-Niculescu-Mizil and Lee (2011)
6 DailyDialog Li et al. (2017)
7 Frames Dataset Asri et al. (2017)
8 PersonaChat ConvAI2 Dataset  Zhang et al. (2018b)
9 The Schema-Guided Dialogue Dataset Rastogi et al. (2019)
10 Self-dialogue Corpus Fainberg et al. (2018)
11 Friends Series Dataset Challenge
12 Taskmaster-1 Byrne et al. (2019)
13 Reddit Corpus Reddit (2019)
Table 3: 13 Dataset Sources

a.3 Persuasion Example

One dialogue example is shown in Table 4.

Role                                            Utterance
SYS Good morning, how are you doing today?
USR Good!
SYS Have you ever heard of Save the Children?
USR I have not heard of it. Would you like to give an introduction?
SYS It is a charity that helps children in need around the world.
USR Cool
SYS I would like to give a little bit more information.
USR Sure. Go ahead.
SYS Save the Children is an international non-governmental organization that promotes children’s rights, provides relief and helps support children in developing countries.
USR It is a good organization. I am glad to know it.
SYS Yes, they are a very professional organization. Would you be interested in donating some of your task money to this charity?
USR Sure. I would like to donate some money to it.
SYS Thank you so much!
USR no problem.
SYS How much do you like to donate to the charity now? Your donation will be directly deducted from your task payment.
USR I would like to give $2 to them.
SYS Thank you so much!
USR You are welcome
SYS I hope you have a great day!
USR You too.
Table 4: An example conversation between a machine persuader trained by PRAL and a human persuadee.