PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable

by   Siqi Bao, et al.

Pre-training models have been proved effective for a wide range of natural language processing tasks. Inspired by this, we propose a novel dialogue generation pre-training framework to support various kinds of conversations, including chit-chat, knowledge grounded dialogues, and conversational question answering. In this framework, we adopt flexible attention mechanisms to fully leverage the bi-directional context and the uni-directional characteristic of language generation. We also introduce discrete latent variables to tackle with the natural born one-to-many mapping problem in response generation. Two reciprocal tasks of response generation and latent act recognition are designed and carried out simultaneously within a shared network. Comprehensive experiments on three publicly available datasets verify the effectiveness and superiority of the proposed framework.


page 1

page 2

page 3

page 4


Semantic-based Pre-training for Dialogue Understanding

Pre-trained language models have made great progress on dialogue tasks. ...

PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation

To explore the limit of dialogue generation pre-training, we present the...

ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation

Current pre-training works in natural language generation pay little att...

DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation

Dialog response generation in open domain is an important research topic...

PALM: Pre-training an Autoencoding Autoregressive Language Model for Context-conditioned Generation

Self-supervised pre-training has emerged as a powerful technique for nat...

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

A benchmark provides an ecosystem to measure the advancement of models w...

Improving Variational Encoder-Decoders in Dialogue Generation

Variational encoder-decoders (VEDs) have shown promising results in dial...

1 Introduction

Dialogue generation is a challenging task due to the limited corpus of human conversations, complex background knowledge, and diverse relationships between utterances. Recently, pre-trained large-scale language models, such as BERT Devlin et al. (2019) and XL-Net Yang et al. (2019), have achieved prominent success in natural language processing. Such models are usually constructed based on a massive scale of general text corpora, like English Wikipedia or BooksCorpus Zhu et al. (2015)

, where distributed representations can be learned automatically from the raw text. By further fine-tuning these representations, breakthroughs have been continuously reported for various downstream tasks, especially those of natural language understanding, such as question answering, natural language inference and so on.

This pre-training and fine-tuning paradigm also sheds interesting light on the tasks of natural language generation, like dialogue generation. However, previous study demonstrates that there are some deficiencies on the performance to apply direct fine-tuning of BERT on small conversation datasets

Rashkin et al. (2019); Wolf et al. (2019), where possible reasons might be three-fold: 1) the underlying linguistic patterns in human conversations can be highly different from those in general text, resulting in a large gap of knowledge or data distributions; 2) the training mode of uni-directional dialogue generation is also distinct with that of bi-directional natural language understating as applied in BERT; 3) unlike most of the general NLP tasks, there is a one-to-many relationship existing in dialogue generation, where a piece of context often has multiple appropriate replies.

In this paper, we propose a new method to tackle the above challenges, aiming to obtain a high-quality pre-training model for dialogue generation. First of all, to reduce the gap between data distributions, large-scale Reddit and Twitter conversations are further utilized to pre-train the generation model (upon the basis of language models pre-trained with general text). Secondly, to mitigate the difference of training modes, a flexible paradigm integrating uni- and bi-directional processing is employed in this work, which is inspired by the latest unified language modeling Dong et al. (2019). Thirdly, a discrete latent variable is introduced to model the one-to-many relationship among utterances in conversations.

Each value of the latent variable corresponds to the particular conversational intent of one response, denoted as latent speech act. Distinct with those controllable dialogue generation based on explicit labels (including emotion, keywords, domain codes and so on) Huang et al. (2018); Keskar et al. (2019)

, our latent variable gets exempted from the restriction of human annotations and can be learned automatically from the corpus in an unsupervised way. To pre-train the model for dialogue generation, two tasks are introduced in this work – response generation and latent act recognition. Both tasks are carried out simultaneously under the unified network architecture with shared parameters. Conditioned on the context and latent variable, the generation task tries to maximize the likelihood of the target response. At the same time, the recognition task aims to estimate the latent variable w.r.t. given context and target response. Apparently, the accurate estimation of latent variable is a key factor to boost the quality of response generation.

We conducted experiments on three different kinds of conversation tasks: chit-chat, knowledge grounded conversation, and conversational question answering. Experimental results verify the effectiveness and superiority of our pre-trained model as compared with the other state-of-the-art methods. Our pre-trained models and source code have been released at GitHub, hoping to facilitate further research progress in dialogue generation.111

2 Dialogue Generation Pre-training

Given a piece of context, there exist multiple appropriate responses, leading to diverse conversation flows. It is widely recognized that the capability of modeling one-to-many relationship is crucial for dialogue generation system Zhao et al. (2017); Chen et al. (2019). To this end, we propose to encode discrete latent variables into transformer blocks for one-to-many relationship modeling, where two reciprocal tasks of response generation and latent act recognition are collaboratively carried out.

2.1 Model Architecture

In our model, there are the following three elements: dialogue context , response and latent variable .

  • [leftmargin=*,noitemsep,topsep=0pt]

  • The dialogue context consists of several history utterances. (For knowledge grounded conversation, the convention is to concatenate background knowledge into the context as well Wolf et al. (2019).)

  • The response is one piece of appropriate reply towards the given context.

  • The latent variable is one

    -way categorical variable

    , with each value corresponds to a particular latent speech act in the response.

The probabilistic relationships among these elements are elaborated as follows (graphical illustration shown in Figure 1). Given a context , there are multiple appropriate speech acts for replies (represented by the latent variable ). Conditioned on the context and one chosen latent speech act, the response is produced as (gray lines). Given a pair of context and response, the latent speech act behind them can be estimated as (dashed blue lines). As such, our pre-training of dialogue generation contains the following two tasks – response generation and latent act recognition.

Figure 1: Graphical illustration of response generation (gray lines) and latent act recognition (dashed blue lines).

We propose a unified infrastructure for the joint learning of both tasks, shown as Figure 2. The backbone of our infrastructure is inspired by the transformer blocks in Dong et al. (2019), which supports both bi-directional encoding and uni-directional decoding flexibly via specific self-attention masks. Both two tasks of response generation and latent act recognition are carried out under the unified network with shared parameters. Their detailed implementations are discussed as follows.

Given the context and a specific speech act , the response generation can be estimated as


where is the length of target response and denotes the previously generated words. Since the response generation is a uni-directional decoding process, each token in the response can only attend those ahead of it, shown as dashed orange lines in Figure 2.

Figure 2: Architecture of dialogue generation with discrete latent variable.
Figure 3: Input representation. The input embedding is the sum of corresponding token, role, turn and position embeddings.

The task of latent act recognition is included to identify the corresponding value of for the given context and target response in the training data. The latent act recognition shares network parameters with response generation, but has separate self-attention masks for bi-directional encoding. As shown in Figure 2, with a special mask symbol [M] as input, it keeps collecting information from the context and target response (red lines). In this way, the corresponding speech act for the target response can be recognized as , where is the estimated posterior distribution over discrete latent values.

2.2 Input Representation

For multi-turn conversation modeling, elaborate designs have been made on the input representation in this work. The network input includes the latent variable, dialogue context and response. Following the pre-processing of BERT Devlin et al. (2019), the input text is tokenized with WordPiece Wu et al. (2016). For each token, its input embedding is the sum of corresponding token, role, turn and position embeddings. One visual example is shown in Figure 3 and details of the embeddings are described as follows:

  • [leftmargin=*,noitemsep,topsep=0pt]

  • The input is the concatenation of latent variable, dialogue context and response. A special end-of-sentence [EOS] token is appended to the end of each utterance for separation. Another begin-of-sentence [BOS] token is added at the beginning of the response, whose final hidden state (i.e., output of the last transformer block) is used to predict next token during generation.

  • Given that is one -way categorical variable, its token embedding is mapped from the latent embedding space . For the rest tokens in the vocabulary, they are warmed up using BERT’s WordPiece embeddings.

  • Role embeddings are employed to differentiate the characters evolved in the conversation. The role embedding is added for the response, as well as dialogue utterances generated by the same character in the context. And role embedding is used for the other character. (For knowledge grounded conversation, is used as the role embedding of background knowledge.)

  • In the interactive conversation, there are multi-turn utterances and we employ relative order in the assignment of turn embeddings. The turn embedding for the response is set to , and the turn embedding of its last utterance is , and etc. Our utilization of relative turn embeddings instead of absolute ones enables the model to assign turn embedding to the response consistently and helps response generation exempt from the disturbance of its round number within the dialogue.

  • Position embeddings are added according to the token position in each utterance. Note that for the special token of latent variable, its corresponding role, turn and position embeddings are all set to empty.

2.3 Pre-training Objectives

We design three kinds of loss functions for dialogue generation pre-training : negative log-likelihood (NLL) loss, bag-of-words (BOW) loss and response selection (RS) loss. Brief illustration is shown in the last column of Figure

2 and detailed descriptions will be provided in this section.

2.3.1 Response Generation

In our model, the response is generated conditioned on the latent variable and the context. The widely adopted NLL loss is employed in the pre-training:


where is the latent speech act of this training pair

, sampled from the probability distribution

. The posterior distribution over latent values is estimated through the task of latent act recognition:


where is the final hidden state of the special mask, and denote the weight matrices of one fully-connected layer.

Besides the classical NLL loss, the bag-of-words loss Zhao et al. (2017) is also employed to facilitate the training process of latent discrete variables:


where refers to the whole vocabulary and is a function that tries to predict the words within the target response in a non-autoregressive way:


where is the final hidden state of the latent variable and is the vocabulary size. denotes the estimated probability of word . As compared with NLL loss, the BOW loss discards the order of words and forces the latent variable to capture the global information of the target response.

2.3.2 Response Selection

Response selection helps distinguish whether the response is relevant with the dialogue context and consistent with the background knowledge. Meanwhile, its score can be regarded as an indicator of coherence during dialogue generation, helping to select the most coherent one from multiple candidate responses.

Particularly, the training of response selection is carried out together with the bi-directional encoding network of latent act recognition. The positive training samples come from the dialogue context and corresponding target response , with label . And the negative samples are created by randomly selecting responses from the corpus , with label . The binary cross-entropy loss of response selection is defined as follows:


The probability is estimated through one fully-connected layer, with the final hidden state of the special mask fed as input:


To sum up, the total objective of our pre-training model is to minimize the integrated loss:


2.4 Pre-training Procedure

Our pre-training model contains 12 transformer blocks, with its network parameters initialized using BERTBASE. Large-scale conversation datasets – Twitter Cho et al. (2014) and Reddit Zhou et al. (2018); Galley et al. (2019) are employed for pre-training, which result in 8.3 million training samples in total. For each training sample of context and target response , it needs to pass through the network twice to accomplish the tasks of latent act recognition and response generation. And the pre-training steps are summarized as follows:

  1. [label=0),leftmargin=*,noitemsep,topsep=0pt]

  2. Latent Act Recognition

    • [leftmargin=*,noitemsep,topsep=0pt]

    • Given a pair of context and target response, estimate the posterior distribution

    • Randomly select and calculate

  3. Response Generation

    • [leftmargin=*,noitemsep,topsep=0pt]

    • With the sampled latent value , calculate and

  4. Optimization

    • [leftmargin=*,noitemsep,topsep=0pt]

    • Sum up to obtain , and update network parameters with back-propagation

The hyper-parameters used in pre-training are listed as follows. The maximum sequence length of context and response is set to 256 and 50, respectively. The number of transformer blocks in our model is 12 and the hidden embedding dimension is 768. The batch size is set to 64 and is set to 20 for the discrete latent variable. Adam optimizer Kingma and Ba (2015) is employed for optimization with a learning rate of 5e-5. The pre-training of dialogue generation was carried out on 8 Nvidia Telsa V100 32G GPU cards for 3.5M steps, taking approximately two weeks to reach convergence.

2.5 Fine-tuning and Inference

Our pre-trained model is flexible enough to support various kinds of dialogues, including chit-chat, knowledge grounded conversations, conversational question answering and so on. The fine-tuning on small conversation datasets can be carried out following the training objectives defined in Equation (8). As the fine-tuning process reaches convergence, the response towards the given context can be obtained through the following inference procedure:

  1. [label=0),leftmargin=*,noitemsep,topsep=0pt]

  2. Candidate Response Generation

    • [leftmargin=*,noitemsep,topsep=0pt]

    • Conditioned on each latent value , generate corresponding candidate response

  3. Response Selection

    • [leftmargin=*,noitemsep,topsep=0pt]

    • Calculate the probability for each response and select the one with highest value as the final response

It is worth to note that the above fine-tuning and inference procedures are set up for the dialogue generation without any specific objectives. If there exists a specific objective within the conversation, such as letting both participants know more about each other Bao et al. (2019)

, the fine-tuning can proceed to maximize the pre-defined rewards with reinforcement learning (RL). Under such circumstances, our latent discrete variable can be naturally treated as action within RL, and thus the response selection can be straightforwardly solved by selecting the action that results in the maximum reward.

3 Experiments

3.1 Settings

3.1.1 Datasets

To evaluate the performance of our proposed method, comprehensive experiments have been carried out on three publicly available datasets.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • Persona-Chat Zhang et al. (2018) provides both manually annotated conversations and corresponding persona profiles (background knowledge), where two participants chat naturally and try to get to know each other.

  • Daily Dialog Li et al. (2017) is a chit-chat dataset, which contains high-quality human conversations about daily life.

  • DSTC7-AVSD Alamri et al. (2019), short for Audio Visual Scene-aware Dialog of the DSTC7 challenge, is a conversational question answering dataset. In DSTC7-AVSD, the system need to generate an answer given dialogue context and background knowledge. There are two available options of knowledge utilization: 1) using single-modal information of text only, including video’s caption and summary; 2) relying on multi-modal information, including text, audio and visual features. The single-modal option is adopted by our method in the experiments.

The descriptions and statistics of these datasets are summarized in Table 1.

3.1.2 Compared Methods

The following models have been compared in the experiments.

Baseline. Sequence to sequence with attention (Seq2Seq) Vinyals and Le (2015) is employed as the baseline for the experiments on Persona-Chat and Daily Dialog. DSTC7-AVSD has provided a baseline system, which is built upon hierarchical recurrent encoders with multi-modal features.

State of the art. The Persona-Chat dataset is also utilized in the ConvAI2 challenge Dinan et al. (2019a), where the team of Lost in Conversation (LIC) Golovanov et al. (2019) obtains the best performance. LIC is also one transformer based generation method and fine-tuned upon the pre-trained model of GPT Radford et al. (2018). For the dataset of Daily Dialog, its best results are reported by the recently developed method – iVAEMI Fang et al. (2019), which generates diverse responses with sample-based latent representation. In DSTC7-AVSD, the team of CMU Sanabria et al. (2019)

obtains the best performance across all the evaluation metrics.

Our method. To better analyze the effects of latent discrete variable in our method, we also compare to the version without latent variable (Our w/o Latent), under the same training settings.111Our w/o latent’s network parameters are also first initialized with BERTBASE. The pre-training is then carried out on Reddit and Twitter, with the objective to minimize NLL loss. The fine-tuning follows the same objective as pre-training on down-stream datasets.

Table 1: Summary of datasets used in the experiments.
Table 2: Experimental results on Persona-Chat and Daily Dialog with automatic and human evaluations, with highest value written in bold.
Table 3: Experimental results on DSTC7-AVSD with automatic evaluation, with highest value written in bold.
Table 4: Examples of response generation with our pre-trained model.
Table 5: Case analysis of response generation on Persona-Chat.

3.1.3 Evaluation Metrics

Both automatic and human evaluations are employed to assess the performance of compared methods. In automatic evaluation, the following metrics are included:

  • [leftmargin=*,noitemsep,topsep=0pt]

  • BLEU Chen and Cherry (2014)

    measures the n-gram overlap between generated response and the target response.

  • Distinct-1/2 Li et al. (2016) measures the generation diversity, which is defined as the number of distinct uni- or bi-grams divided by the total amount of generated words.

  • Knowledge R/P/F1 Dinan et al. (2019b) measures the degree of informativeness w.r.t. background knowledge, defined as:


    where and refers to the set of non-stop words in the generated responses and background knowledge respectively.

  • In DSTC7-AVSD, the MSCOCO platform Chen et al. (2015) is employed for evaluation. It compares the generated response with six ground truth responses, using metrics of BLEU, METEOR, ROUGH-L and CIDEr.

In human evaluation, we randomly select 100 dialogue contexts and generate responses with compared methods. Three crowd-sourcing workers are asked to score the response quality on a scale of [0, 1, 2] from four aspects – fluency, coherence, informativeness and overall. The higher score, the better. Details about the criteria are given as follows.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • Fluency measures whether the generated sentence is smooth and grammatically correct.

  • Coherence evaluates whether the generated response is relevant with the context and consistent with the expressed information or background knowledge.

  • Informativeness assesses the information contained in the generated response.

  • Overall represents the general evaluation, where 0 indicates a bad response, 1 corresponds to a normal response and 2 stands for a good response.

After collecting the assessments from three crowd-sourcing workers, the response’s final score is determined via majority voting. The average Fleiss’s kappa Fleiss and Cohen (1973) on Persona-Chat and Daily Dialog is 0.515 and 0.480 respectively, indicating annotators have reached moderate agreement.

3.2 Experimental Results

The experimental results on Persona-Chat and Daily Dialog with automatic and human evaluations are summarized in Table 2. During automatic evaluation, BLEU-1/2 measures the overlap between generated response and ground truth, Distinct-1/2 assesses the diversity of words in generation and Knowledge R/P/F1 evaluates the information expression w.r.t. background knowledge. However, the results demonstrate that no method can consistently outperform the others under automatic evaluation. As shown in the empirical study Liu et al. (2016), there is a weak correlation between automatic metrics and human judgments in open-domain dialogue generation. As such, it is suggested to treat these automatic evaluations as a reference and put emphasis on human evaluations.

During human evaluations, it is shown that our method obtains consistently better performance across all the metrics on Persona-Chat and Daily Dialog. The scores of fluency almost approach the upper bound, revealing that our generated responses are very fluent. The informativeness assessments indicate that the information in our generated responses is significantly richer, as compared with the baseline methods. Our responses are coherent with the context and favored most by crowd-sourcing workers. The ablation study with our method and our w/o latent also suggests that through the incorporation of discrete latent variables, remarkable improvements can be achieved for dialogue generation. Besides, it can be observed that the generation quality of transformed-based approaches (LIC and our method) is significantly better than that of RNN-based methods (Seq2Seq and iVAEMI).111It is a normal phenomenon that the performance of our w/o latent is close to that of LIC. Both of them initialize network parameters with pre-trained language models, continue training with large-scale conversation data as Reddit, and adopt NLL-related objectives.

The experimental results on DSTC7-AVSD with automatic evaluation are provided in Table 3. Distinct with the above chit-chat datasets, there are six ground truth responses in DSTC7-AVSD, which makes the automatic evaluation become more effective and align better with human judgments. In the experiments, our response selection is strengthened with an extra ranking step, which ranks the candidates according to the automatic scores and selects the top one as the final answer. The results in Table 3 demonstrate that our method has brought a new breakthrough for DSTC7-AVSD. Additionally, the upper bound of our method is also reported, under the ideal scenario that the optimal candidate answer can be selected.222Given a dialogue context and background knowledge, our model is able to generate diverse responses. Each of them will be evaluated and the one obtaining the best score will be treated as the optimal candidate answer. The incredible results validate the great potential of our approach.

3.3 Discussions

3.3.1 Case Analysis

To further dissect the quality of our pre-trained model, several examples of generated responses are provided in Table 4. For each piece of context, our model can produce multiple responses by assigning distinct values to the latent variable and five candidate responses are selected for display in the table. It shows that our pre-trained model is able to generate diverse and appropriate responses. Interestingly, as the training corpus includes conversations from Reddit threads, some URLs may interweave with dialogue utterances. It seems that this pattern has been captured by the latent variable and sometimes our model generates related Wikipedia links as the reply.

In Table 5, it provides the cases of our method and compared approaches on Persona-Chat, where two participants chat with each other according to their personas. As shown in the example, participant P2 needs to produce a response towards the given dialogue context, conditioned on his/her persona profile. The baseline Seq2Seq tends to generate common replies with low informativeness and poor coherence. LIC and Our w/o Latent are able to produce some coherent responses, whereas deficient on informativeness. In comparison, the response by our method is not only coherent with the context, but also expressive of the background personas.

3.3.2 Comparison of Pre-trained Models

To further analyze the effectiveness of our pre-trained model, ablation studies have been conducted on Daily Dialog. The compared methods include the baseline Seq2Seq, direct fine-tuning of BERT, GPT-2 Radford et al. (2019) and our pre-trained model. And there are three different sizes of training dialogues: 1k, 5k and 11k (total training data). The experimental results measured with perplexity are summarized in Table 6. These results demonstrate that our method outperforms the baseline and other pre-training models consistently with lower perplexity across different training sets. Even with the low-resource data of 1k conversations, our model can still obtain prominent performance.

Several interesting conclusions can be also drawn from these results. Firstly, the comparison between BERT and GPT-2 fine-tuning indicates that uni-directional pre-trained models are more suitable for dialogue generation. Secondly, our method obtains better performance than GPT-2, which may result from three aspects: 1) our pre-training is carried out with the datasets of Reddit and Twitter, which are closer to human conversations as compared with general text; 2) in the pre-training, we adopt more flexible attention mechanisms to fully leverage the bi-directional and uni-directional information within the context and response. 3) our model has effectively modeled the one-to-many relationship with discrete latent variable, whose effect has been verified in Table 2.

Table 6: Perplexity of different pre-trained models on Daily Dialog, with best value written in bold.

4 Related Work

Related work contains pre-trained language models and one-to-many modeling in dialogue generation.

Pre-trained Language Models. Pre-trained language models, which are trained with large-scale general text, have brought many breakthroughs on various NLP tasks. These models can be roughly divided into two categories according to their attention mechanisms. GPT Radford et al. (2018) and GPT-2 Radford et al. (2019) are representative uni-directional language models, where one token is only allowed to attend its previous tokens and the objective is to maximize left-to-right generation likelihood. BERT Devlin et al. (2019) and XL-Net Yang et al. (2019) are bi-directional language models, where bi-directional context attention is enabled for token prediction. The latest unified language model UniLM Dong et al. (2019) is able to support both uni- and bi-directional attention with flexible self-attention mask designs. Recently, some attempts Golovanov et al. (2019); Wolf et al. (2019) have been made to adapt generative language models GPT or GPT-2 for dialogue generation. Whereas the special issues of conversations, such as impacts from background knowledge and problems of one-to-many relationship, are not fully considered and tackled in these adaptations.

One-to-many Modeling. Given one piece of context, there exists multiple appropriate responses, which is know as the one-to-many mapping problem. To model this one-to-many relationship, CVAE Zhao et al. (2017)

employs Gaussian distribution to capture the discourse-level variations of responses. To alleviate the issue of posterior collapse in VAE, some extension approaches are further developed, including conditional Wasserstein auto-encoder of DialogWAE

Gu et al. (2019) and implicit feature learning of iVAEMI Fang et al. (2019). Besides the continuous representation in VAE, discrete categorical variables are also utilized for interpretable generation Zhao et al. (2018). Additionally, multiple mapping modules as latent mechanisms are introduced for diverse generation Chen et al. (2019), where accurate optimization is carried out via posterior mapping selection. However, due to the small scale of annotated conversation data and limited capacity of generation network, it remains challenging for these methods to balance the diversity and fluency during response generation.

5 Conclusion

A novel pre-training model for dialogue generation is introduced in this paper, incorporated with latent discrete variables for one-to-many relationship modeling. To pre-train our model, two reciprocal tasks of response generation and latent recognition are carried out simultaneously on large-scale conversation datasets. Our pre-trained model is flexible enough to handle various down-stream tasks of dialogue generation. Extensive and intensive experiments have been carried out on three publicly available datasets. And the results demonstrate that our model obtains significant improvements over the other state-of-the-art methods.

Our work can be potentially improved with more fine-grained latent variables. We will also explore to boost the latent selection policy with reinforcement learning and extend our pre-training to support dialogue generation in other languages.


We would like to thank Chaotao Chen, Junkun Chen, Tong Wu and Wenxia Zheng for their generous help. This work was supported by the National Key Research and Development Project of China (No. 2018AAA0101900), and the Natural Science Foundation of China (No.61533018).


  • H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson, et al. (2019) Audio visual scene-aware dialog. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7558–7567. Cited by: 3rd item.
  • S. Bao, H. He, F. Wang, R. Lian, and H. Wu (2019) Know more about each other: evolving dialogue strategy via compound assessment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5382–5391. Cited by: §2.5.
  • B. Chen and C. Cherry (2014) A systematic comparison of smoothing techniques for sentence-level bleu. In Proceedings of the 9th Workshop on Statistical Machine Translation, pp. 362–367. Cited by: 1st item.
  • C. Chen, J. Peng, F. Wang, J. Xu, and H. Wu (2019) Generating multiple diverse responses with multi-mapping and posterior mapping selection. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    pp. 4918–4924. Cited by: §2, §4.
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: 4th item.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734. Cited by: §2.4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Cited by: §1, §2.2, §4.
  • E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, et al. (2019a) The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098. Cited by: §3.1.2.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2019b)

    Wizard of wikipedia: knowledge-powered conversational agents

    International Conference on Learning Representations. Cited by: 3rd item.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197. Cited by: §1, §2.1, §4.
  • L. Fang, C. Li, J. Gao, W. Dong, and C. Chen (2019) Implicit deep latent variable models for text generation. arXiv preprint arXiv:1908.11527. Cited by: §3.1.2, §4.
  • J. L. Fleiss and J. Cohen (1973) The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. In Educational and psychological measurement, pp. 613–619. Cited by: §3.1.3.
  • M. Galley, C. Brockett, X. Gao, J. Gao, and B. Dolan (2019) Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenge Workshop, Cited by: §2.4.
  • S. Golovanov, R. Kurbanov, S. Nikolenko, K. Truskovskyi, A. Tselousov, and T. Wolf (2019)

    Large-scale transfer learning for natural language generation

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6053–6058. Cited by: §3.1.2, §4.
  • X. Gu, K. Cho, J. Ha, and S. Kim (2019) DialogWAE: multimodal response generation with conditional wasserstein auto-encoder. International Conference on Learning Representations. Cited by: §4.
  • C. Huang, O. Zaiane, A. Trabelsi, and N. Dziri (2018) Automatic dialogue generation with expressed emotions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 49–54. Cited by: §1.
  • N. S. Keskar, B. McCann, L. Varshney, C. Xiong, and R. Socher (2019) CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv preprint arXiv:1909.05858. Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §2.4.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: 2nd item.
  • Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) DailyDialog: a manually labelled multi-turn dialogue dataset. In Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 986–995. Cited by: 2nd item.
  • C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016) How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132. Cited by: §3.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report, OpenAI. Cited by: §3.1.2, §4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Technical report, OpenAI. Cited by: §3.3.2, §4.
  • H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2019) Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5370–5381. Cited by: §1.
  • R. Sanabria, S. Palaskar, and F. Metze (2019) CMU sinbad’s submission for the dstc7 avsd challenge. In AAAI Dialog System Technology Challenge Workshop, Cited by: §3.1.2.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §3.1.2.
  • T. Wolf, V. Sanh, J. Chaumond, and C. Delangue (2019)

    Transfertransfo: a transfer learning approach for neural network based conversational agents

    arXiv preprint arXiv:1901.08149. Cited by: §1, 1st item, §4.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: §2.2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §4.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2204–2213. Cited by: 1st item.
  • T. Zhao, K. Lee, and M. Eskenazi (2018) Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 1098–1107. Cited by: §4.
  • T. Zhao, R. Zhao, and M. Eskenazi (2017)

    Learning discourse-level diversity for neural dialog models using conditional variational autoencoders

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 654–664. Cited by: §2.3.1, §2, §4.
  • H. Zhou, T. Young, M. Huang, H. Zhao, J. Xu, and X. Zhu (2018) Commonsense knowledge aware conversation generation with graph attention. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4623–4629. Cited by: §2.4.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27. Cited by: §1.