PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning

06/30/2020 ∙ by Siqi Bao, et al. ∙ Baidu, Inc. 0

To build a high-quality open-domain chatbot, we introduce the effective training process of PLATO-2 via curriculum learning. There are two stages involved in the learning process. In the first stage, a coarse-grained generation model is trained to learn response generation under the simplified framework of one-to-one mapping. In the second stage, a fine-grained generation model and an evaluation model are further trained to learn diverse response generation and response coherence estimation, respectively. PLATO-2 was trained on both Chinese and English data, whose effectiveness and superiority are verified through comprehensive evaluations, achieving new state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, task agnostic pre-training with large-scale transformer models Vaswani et al. (2017) and general text corpora has achieved great success in natural language understanding Devlin et al. (2019)

as well as natural language generation, especially open-domain dialogue generation. For instance, based on the general language model GPT-2

Radford et al. (2019), DialoGPT Zhang et al. (2020) is further trained for response generation using Reddit comments. To obtain a human-like open-domain chatbot, Meena Adiwardana et al. (2020) scales up the network parameters to 2.6B and employs more social media conversations in the training process, leading to significant improvement on response quality. To mitigate undesirable toxic or bias traits of large corpora, Blender Roller et al. (2020) further fine-tunes the pre-trained model with human annotated datasets and emphasizes desirable conversational skills of engagingness, knowledge, empathy and personality.

Besides the above attempts from model scale and data selection, PLATO Bao et al. (2020) aims to tackle the inherent one-to-many mapping problem to improve response quality. The one-to-many mapping refers to that one dialogue context might correspond to multiple appropriate responses. It is widely recognized that the capability of modeling one-to-many relationship is crucial for response generation Zhao et al. (2017); Chen et al. (2019). PLATO explicitly models this one-to-many relationship via discrete latent variables, aiming to boost the quality of dialogue generation.

In this work, we will try to scale up PLATO to PLATO-2 and discuss its effective training schema via curriculum learning Bengio et al. (2009). There are two stages involved in the whole learning process, as sketched in Figure 1. In the first stage, under the simplified one-to-one mapping modeling, a coarse-grained generation model is trained for appropriate response generation under different conversation contexts. The second stage continues to refine the generation with a fine-grained generation model and an evaluation model. The fine-grained generation model explicitly models the one-to-many mapping relationship for diverse response generation. To select the most appropriate responses generated by the fine-grained generation model, the evaluation model is trained to estimate the coherence of the responses.

Figure 1: Curriculum learning process in PLATO-2.

As for response selection, previous studies have employed variant scoring functions, including forward response generation probability

Adiwardana et al. (2020), backward context recover probability Zhang et al. (2020) and bi-directional coherence probability Bao et al. (2020). However, the forward score favors safe and generic responses due to the property of maximum likelihood, while the backward score tends to select the response with a high overlap with the context, resulting in repetitive conversations. In order to ameliorate the above problems, we adopt the bi-directional coherence estimation in the evaluation model of PLATO-2, whose effectiveness is also verified in the experiments.

We trained PLATO-2 models with different sizes: 1.6 Billion parameters and 310 Million parameters. In addition to the English models, we also trained Chinese models with massive social media conversations. Comprehensive experiments on both English and Chinese datasets demonstrate that PLATO-2 outperforms the state-of-the-art models. We have released our English models and source codes at GitHub, hoping to facilitate the research in open-domain dialogue generation.111https://github.com/PaddlePaddle/Knover/tree/master/plato-2

2 Methodology

2.1 Model Architecture

The infrastructure of PLATO-2 is shown in Figure 2(a), consisted of transformer blocks. The key components of the transform block include layer normalization, multi-head attention and feed forward layers. In the arrangement of these components, there are two options with regarding to the location of layer normalization. One way is the original post-normalization used in BERT Devlin et al. (2019)

, where layer normalization is placed between residual connections. The other way is the recent pre-normalization used in GPT-2

Radford et al. (2019), where layer normalization is placed within residual connections. As reported in Megatron-LM Shoeybi et al. (2019), post-normalization leads to performance degradation as the model size increases, and pre-normalization enables stable training in large-scale models. As such, pre-normalization is adopted in our model.

Figure 2: PLATO-2 illustration. (a) Network overview with the details of transformer blocks. (b) Curriculum learning process with self-attention visualization and training objectives.

Besides, unlike conventional Seq2Seq, there are no separate encoder and decoder networks in our infrastructure. For the sake of training efficiency, PLATO-2 keeps the unified network for bi-directional context encoding and uni-directional response generation through flexible attention mechanism Dong et al. (2019); Bao et al. (2020).

2.2 Curriculum Learning

In this work, we carry out effective training of PLATO-2 via curriculum learning. As shown in Figure 2(b), there are two stages involved in the learning process: during stage 1, a coarse-grained baseline model is trained for general response generation under the simplified one-to-one mapping relationship; during stage 2, two models of fine-grained generation and evaluation are further trained for diverse response generation and response coherence estimation respectively. Although the backbones of these models are all transformer blocks, the self-attention masks are designed accordingly in order to fit the training objectives.

2.2.1 General Response Generation

It is well known that there exists a one-to-many relationship in conversations, where a piece of context may have multiple appropriate responses. Since the classical generation network is designed to fit one-to-one mapping, they tend to generate generic and dull responses. Whereas, it is still an efficient way to capture the general characteristics of response generation. As such, we first train a coarse-grained baseline model to learn general response generation under the simplified relationship of one-to-one mapping. Given one training sample of context and response , we need to minimize the following negative log-likelihood (NLL) loss:

(1)

where is the length of the target response and denotes previously generated words. Since the response generation is a uni-directional decoding process, each token in the response only attends to those before it, shown as dashed orange lines in Figure 2. As for the context, bi-directional attention is enabled for better natural language understanding, shown as blue lines in Figure 2.

2.2.2 Diverse Response Generation

Based upon the coarse-grained baseline model, diverse response generation is further trained under the relationship of one-to-many mapping. Following our previous work PLATO Bao et al. (2020), the discrete latent variable is introduced for the one-to-many relationship modeling. It will first estimate the latent act distribution of the training sample and then generate the response with the sampled latent variable . The NLL loss of diverse response generation is defined as follows:

(2)

where is the latent act sampled from . The posterior distribution over latent values is estimated through the task of latent act recognition:

(3)

where is the final hidden state of the special mask [M], and denote the weight matrices of one fully-connected layer.

Besides the classical NLL loss, the bag-of-words (BOW) loss Zhao et al. (2017) is also employed to facilitate the training process of discrete latent variables:

(4)

where refers to the whole vocabulary. The function tries to predict the words within the target response in a non-autoregressive way:

(5)

where is the final hidden state of the latent variable and is the vocabulary size. denotes the estimated probability of word . As compared with NLL loss, the BOW loss discards the order of words and forces the latent variable to capture the global information of the target response.

To sum up, the objective of the fine-grained generation model is to minimize the following integrated loss:

(6)

2.2.3 Response Coherence Estimation

Recently one strategy has been shown effective in boosting response quality, which first generates multiple candidate responses and then performs ranking according to a score function. While there are different definitions about this score function, which can be divided into three categories. Firstly, the length-average log-likelihood is employed in Meena Adiwardana et al. (2020) as the score function, which considers the forward generation probability of the response given the context . Secondly, the maximum mutual information is utilized in DialoGPT Zhang et al. (2020), which considers the backward probability to recover the context given the candidate response . Thirdly, a discriminative function is used in PLATO Bao et al. (2020) for coherence estimation between the context and response , where is the coherence label. Given that the forward score favors safe responses and the backward score produces repetitive conversations, PLATO-2 adopts the bi-directional coherence estimation in the evaluation model.

The loss of response coherence estimation (RCE) is defined as follows:

(7)

The positive training samples come from the dialogue context and corresponding target response , with coherence label . And the negative samples are created by randomly selecting responses from the corpus , with coherence label .

To maintain the capacity of distributed representation, the task of masked language model (MLM)

Devlin et al. (2019) is also included in the evaluation network. Within this task, 15% of the input tokens will be masked at random and the network needs to recover the masked ones. The MLM loss is defined as:

(8)

where refers to the input tokens of context and response. stands for masked tokens and denotes the rest unmasked ones.

To sum up, the objective of the evaluation model is to minimize the following integrated loss:

(9)

3 Experiments

3.1 Training Data

PLATO-2 has English and Chinese models, with training data extracted from open-domain social media conversations. The details are elaborated as follows.

3.1.1 English Data

The English training data is extracted from Reddit comments, which are collected by a third party and made publicly available on pushshift.io Baumgartner et al. (2020). As the comments are formatted in message trees, any conversation path from the root to a tree node can be treated as one training sample, with the node as response and its former turns as context. To improve the generation quality, we carry out elaborate data cleaning. A message node and its sub-trees will be removed if any of the following conditions is met.

  1. [label=0),leftmargin=*,noitemsep,topsep=0pt]

  2. The number of BPE tokens is more than 128 or less than 2.

  3. Any word has more than 30 characters or the message has more than 1024 characters.

  4. The percentage of alphabetic characters is less than 70%.

  5. The message contains URL.

  6. The message contains special strings, such as r/, u/, &amp.

  7. The message has a high overlap with the parent’s text.

  8. The message is repeated more than 100 times.

  9. The message contains offensive words.

  10. The subreddit is quarantined.

  11. The author is a known bot.

After filtering, the data is split into training and validation sets in chronological order. The training set contains 684M (context, response) samples, ranging from December 2005 to July 2019. For the validation set, 0.2M samples are selected from the rest data after July 2019. The English vocabulary contains 8K BPE tokens Sennrich et al. (2016), constructed with SentencePiece library.

3.1.2 Chinese Data

The Chinese training data is collected from public domain social medias, followed by a similar cleaning process. After filtering, there are 1.2B (context, response) samples in the training set and 0.1M samples in the validation set. As for the Chinese vocabulary, it contains 30K BPE tokens.

3.2 Training Details

PLATO-2 has two model sizes: a small version of 310M parameters and a standard version of 1.6B parameters. The 310M parameter model has 24 transformer blocks and 16 attention heads, with the embedding dimension of 1024. The 1.6B parameter model has 32 transformer blocks and 32 attention heads, with the embedding dimension of 2048.

The hyper-parameters used in the training process are listed as follows. The maximum sequence lengths of context and response are all set to 128. We use Adam Kingma and Ba (2015) as the optimizer, with a learning rate scheduler including a linear warmup and an invsqrt decay Vaswani et al. (2017). To train the large-scale model with a relatively large batch size, we employ gradient checkpointing Chen et al. (2016) to trade computation for memory. Detailed configurations are summarized in Table 1. The training was carried out on 64 Nvidia Tesla V100 32G GPU cards. It takes about 3 weeks for the 1.6B parameter model to accomplish the curriculum learning process.

3.3 Evaluation Settings

3.3.1 Compared Methods

The following methods have been compared in the experiments.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • DialoGPT Zhang et al. (2020) is trained on the basis of GPT-2 Radford et al. (2019) using Reddit comments. There are three model sizes: 117M, 345M and 762M. Since the 345M parameter model obtains the best performance in their evaluations, we compare with it in the experiments.

  • Blender Roller et al. (2020) is firstly trained using Reddit comments and then fine-tuned with human annotated conversations, to help emphasize desirable conversational skills of engagingness, knowledge, empathy and personality. During fine-tuning, there are four datasets included: ConvAI2 Zhang et al. (2018); Dinan et al. (2020), Empathetic Dialogues Rashkin et al. (2019), Wizard of Wikipedia Dinan et al. (2019) and Blended Skill Talk Smith et al. (2020). These annotated conversations are referred as BST in short. Blender has three model sizes: 90M, 2.7B and 9.4B. Since the 2.7B parameter model obtains the best performance in their evaluations, we compare with it in the experiments.

  • Meena Adiwardana et al. (2020) is an open-domain chatbot trained with social media conversations. Meena has 2.6B model parameters, similar to Blender. Given that Meena has not released the model or provided a service interface, it is difficult to perform comprehensive comparison. In the experiments, we include the provided samples in their paper for static evaluation.

  • Microsoft XiaoIce Zhou et al. (2020) is a popular social chatbot in Chinese. In the experiments, we use the official Weibo platform to chat with XiaoIce.

For the sake of comprehensive and fair comparisons, three versions of PLATO-2 are included in the experiments.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • PLATO 1.6B parameter model is the standard version in English, which is first trained using Reddit comments and then fine-tuned with BST conversations. This model will be compared to the state-of-the-art open-domain chatbot Blender, to measure the effectiveness of PLATO-2.

  • PLATO 310M parameter model is a small version in English, which is trained with Reddit comments. This model will be compared to DialoGPT, as they have similar model scales and training data.

  • PLATO 333M parameter Chinese model222This model has 24 transformer blocks and 16 attention heads, with the embedding dimension of 1024. As the Chinese vocabulary contains 30K BPE tokens, this model has 23M more parameters than the English small model. will be compared to XiaoIce in the experiments.

Table 1: Training configurations of PLATO-2.
Table 2: Self-chat evaluation results, with best value written in bold.
Table 3: Chinese interactive evaluation results, with best value written in bold.
Figure 3: Self-chat examples by Blender and PLATO-2.

3.3.2 Evaluation Metrics

We carry out both automatic and human evaluations in the experiments. In automatic evaluation, to assess the model’s capacity on lexical diversity, we use the corpus-level metric of distinct-1/2 Li et al. (2016a), which is defined as the number of distinct uni- or bi-grams divided by the total number of generated words.

In human evaluation, we employ four utterance-level and dialogue-level metrics, including coherence, informativeness, engagingness and humanness. Three crowd-sourcing workers are asked to score the response/dialogue quality on a scale of [0, 1, 2], with the final score determined through majority voting. The higher score, the better. These criteria are discussed as follows, with scoring details provided in the Appendix.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • Coherence is an utterance-level metric, measuring whether the response is relevant and consistent with the context.

  • Informativeness is also an utterance-level metric, evaluating whether the response is informative or not given the context.

  • Engagingness is a dialogue-level metric, assessing whether the annotator would like to talk with the speaker for a long conversation.

  • Humanness is also a dialogue-level metric, judging whether the speaker is a human being or not.

3.4 Experimental Results

In the experiments, we include both static and interactive evaluations.

3.4.1 Self-Chat Evaluation

Self-chats have been widely used in the evaluation of dialogue systems Li et al. (2016b); Bao et al. (2019); Roller et al. (2020), where a model plays the role of both partners in the conversation. As compared with human-bot conversations, self-chat logs can be collected efficiently at a cheaper price. As reported in li2019acute, self-chat evaluations exhibit high agreement with the human-bot chat evaluations. In the experiments, we ask the bot to perform self-chats and then invite crowd-sourcing workers to evaluate the dialogue quality.

The way to start the interactive conversation needs special attention. As pointed out by roller2020recipes, if starting with ’Hi!’, partners tend to greet with each other and only cover some shallow topics in the short conversation. Therefore, to expose the model’s weaknesses and explore the model’s limits, we choose to start the interactive conversation with pre-selected topics. We use the classical 200 questions as the start topic Vinyals and Le (2015) and ask the bot to performance self-chats given the context. There are 10 utterances in each dialogue, including the input start utterance. We carry out automatic evaluation on the 200 self-chat logs and randomly select 50 conversations from 200 self-chat logs for human evaluation.

The compared models are divided into two groups. The first group includes DialoGPT 345M model and PLATO 310M model. Both of them are trained using Reddit comments and have similar model scales. The second group includes Blender 2.7B model and PLATO 1.6B model. Both of them are trained using Reddit comments and further fine-tuned with BST conversations. In human evaluation, two self-chat logs, which are from the same group and have the same start topic, will be displayed to three crowd-sourcing workers. One example is given in Figure 3. As suggested in ACUTE-Eval Li et al. (2019), we ask crowd-sourcing workers to pay attention to only one speaker within a dialogue. In the evaluation, they need to give scores on coherence and informativeness for each P1’s utterance, and assess P1’s overall quality on engagingness and humanness. The self-chat evaluation results are summarized in Table 2. These results demonstrate that PLATO 1.6B model obtains the best performance across human and automatic evaluations. The gap of Blender and PLATO-2 on the corpus-level metric distinct-1/2 suggests that PLATO-2 has a better capacity on lexical diversity. In addition, the difference between these two groups indicates that enlarging model scales and exploiting human annotated conversations help improve the dialogue quality.

3.4.2 Human-Bot Chat Evaluation

In the Chinese evaluation, it is difficult to carry out self-chats for Microsoft XiaoIce, as there is no public available API. Therefore, we collect human-bot conversations through their official Weibo platform. The interactive conversation also starts with a pre-selected topic and continues for 7-14 rounds, where 50 diverse topics are extracted from the high-frequency topics of a commercial chatbot, including travel, movie, hobby and so on. The collected human-bot conversations are distributed to crowd-sourcing workers for evaluation. The human and automatic evaluation results are summarized in Table 3

. XiaoIce obtains higher distinct values, which may use a retrieval-based strategy in response generation. The human evaluations demonstrate that our PLATO-2 model achieves significant improvements over XiaoIce across all the human evaluation metrics.

Table 4: Static evaluation results, with best value written in bold.
Figure 4: Human-bot chat examples by Microsoft XiaoIce and PLATO-2.

3.4.3 Static Evaluation

Besides the interactive evaluation, we also include static evaluation to analyze the model’s performance. In static evaluation, each model will produce a response towards the given multi-turn context. To compare with Meena, we include their provided 60 static samples in the paper’s Appendix and generate corresponding responses with other models. We also include 60 test samples about daily life from Daily Dialog Li et al. (2017) and 60 test samples about in-depth discussion from Reddit. Given that the measurement of humanness usually needs multi-turn interaction, this metric is excluded from static evaluation. The evaluation results are summarized in Table 4. It can be observed that PLATO-2 is able to produce coherent, informative and engaging responses under different chat scenarios.

3.5 Discussions

3.5.1 Case Analysis

To further analyze the models’ features, two self-chat examples of Blender and PLATO-2 are provided in Figure 3. Although both models are able to produce high-quality engaging conversations, they exhibit distinct discourse styles. Blender tends to switch topics quickly in the short conversation, including alcohol, hobbies, movies and work. The emergence of this style might be related with BST fine-tuning data. For instance, ConvAI2 is about the exchange of personal information between two partners, where topics need to switch quickly to know more about each other. Due to the task settings of data collection, some human annotated conversations might be a little unnatural. Nevertheless, fine-tuning with BST conversations is essential to mitigate undesirable toxic traits of large corpora and emphasize desirable skills of human conversations.

Distinct with Blender, PLATO-2 can stick to the start topic and conduct in-depth discussions. The reasons might be two-fold. On the one hand, our model is able to generate diverse and informative responses with the accurate modeling of one-to-many relationship. On the other hand, the evaluation model helps select the coherent response and stick to current topic. We also asked crowd-sourcing workers to compare the responses generated by these two models via pairwise ranking. The comparison result is shown in Table 5, which also verifies our above analysis on discourse styles.

Besides, we also provide two human-bot chat examples of XiaoIce and PLATO-2 in Figure 4, with original interactive logs shown on the left and translated logs on the right. It can be observed that some responses produced by XiaoIce are not coherent with the contexts and there are some abrupt changes of topics. By contrast, the interaction with PLATO-2 is more coherent and engaging.

Table 5: Thoroughness with regard to the start topic.
Table 6: Comparison of different score functions in response selection, with best value written in bold.

3.5.2 Response Selection Comparison

We carry out more experiments to compare the performance of distinct score functions in response selection. Firstly, one Chinese response selection dataset is constructed: 100 dialogue contexts are selected from the test set and 10 candidate responses are retrieved for each context with a commercial chatbot. Secondly, we ask crowd-sourcing workers to annotate the label whether the candidate response is coherent with the context. Thirdly, we train three 333M parameter models as the score function, including the forward response generation probability , the backward context recover probability and the bi-directional coherence probability . Their results on the annotated response selection dataset are summarized in Table 6. The metrics of mean average precision (MAP) Baeza-Yates et al. (1999), mean reciprocal rank (MRR) Voorhees and others (1999) and precision at position 1 (P@1) are employed. These results indicate that PLATO-2’s evaluation model is better at selecting appropriate responses.

4 Related Work

Related works include large-scale language models and open-domain dialogue generation.

Large-scale Language Models. Pre-trained large-scale language models have brought many breakthroughs on various NLP tasks. GPT Radford et al. (2018) and BERT Devlin et al. (2019) are representative uni-directional and bi-directional language models, trained on general text corpora. By introducing pre-normalization and modifying weight initialization, GPT-2 Radford et al. (2019) successfully extends the model scale from 117M to 1.5B parameters. To cope with memory constraints, Megatron-LM Shoeybi et al. (2019) exploits model parallelism to train an 8.3B parameter model on 512 GPUs. GPT-3 Brown et al. (2020) further trains an 175B parameter autoregressive language model, demonstrating strong performance on many NLP tasks. The development of large-scale language models is also beneficial to the task of dialogue generation.

Open-domain Dialogue Generation. On the basis of GPT-2, DialoGPT Zhang et al. (2020) is trained for response generation using Reddit comments. To obtain a human-like open-domain chatbot, Meena Adiwardana et al. (2020) scales up the network parameters to 2.6B and utilizes more social media conversations in the training process. To emphasize desirable conversational skills of engagingness, knowledge, empathy and personality, Blender Roller et al. (2020) further fine-tunes the pre-trained model with human annotated conversations. Besides the above attempts on model scale and data selection, PLATO Bao et al. (2019) introduces discrete latent variable to tackle the inherent one-to-many mapping problem to improve response quality. In this work, we further scale up PLATO to PLATO-2 and discuss its effective training via curriculum learning.

5 Conclusion

In this work, we discuss the effective training of open-domain chatbot PLATO-2 via curriculum learning, where two stages are involved. In the first stage, one coarse-grained model is trained for general response generation. In the second stage, two models of fine-grained generation and evaluation are trained for diverse response generation and response coherence estimation. Experimental results demonstrate that PLATO-2 achieves substantial improvements over the state-of-the-art methods in both Chinese and English evaluations.

Acknowledgments

We would like to thank Jingzhou He, and Tingting Li for the help on resource coordination; Daxiang Dong, and Pingshuo Ma for the support on PaddlePaddle; Yu Sun, Yukun Li, and Han Zhang for the assistance with infrastructure and implementation. This work was supported by the Natural Key Research and Development Project of China (No. 2018AAA0101900).

References

  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al. (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Cited by: §1, §1, §2.2.3, 3rd item, §4.
  • R. Baeza-Yates, B. Ribeiro-Neto, et al. (1999) Modern information retrieval. Vol. 463, ACM press New York. Cited by: §3.5.2.
  • S. Bao, H. He, F. Wang, R. Lian, and H. Wu (2019) Know more about each other: evolving dialogue strategy via compound assessment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5382–5391. Cited by: §3.4.1, §4.
  • S. Bao, H. He, F. Wang, H. Wu, and H. Wang (2020) PLATO: pre-trained dialogue generation model with discrete latent variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 85–96. Cited by: §1, §1, §2.1, §2.2.2, §2.2.3.
  • J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn (2020) The pushshift reddit dataset. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 14, pp. 830–839. Cited by: §3.1.1.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    ,
    pp. 41–48. Cited by: §1.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §4.
  • C. Chen, J. Peng, F. Wang, J. Xu, and H. Wu (2019) Generating multiple diverse responses with multi-mapping and posterior mapping selection. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    ,
    pp. 4918–4924. Cited by: §1.
  • T. Chen, B. Xu, C. Zhang, and C. Guestrin (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Cited by: §1, §2.1, §2.2.3, §4.
  • E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, et al. (2020) The second conversational intelligence challenge (convai2). In The NeurIPS’18 Competition, pp. 187–208. Cited by: 2nd item.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2019)

    Wizard of wikipedia: knowledge-powered conversational agents

    .
    International Conference on Learning Representations. Cited by: 2nd item.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197. Cited by: §2.1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §3.2.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016a) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §3.3.2.
  • J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016b)

    Deep reinforcement learning for dialogue generation

    .
    In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 1192–1202. Cited by: §3.4.1.
  • M. Li, J. Weston, and S. Roller (2019) Acute-eval: improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087. Cited by: §3.4.1.
  • Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) DailyDialog: a manually labelled multi-turn dialogue dataset. In Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 986–995. Cited by: §3.4.3.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report, OpenAI. Cited by: §4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Technical report, OpenAI. Cited by: §1, §2.1, 1st item, §4.
  • H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2019) Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5370–5381. Cited by: 2nd item.
  • S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, K. Shuster, E. M. Smith, et al. (2020) Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. Cited by: §1, 2nd item, §3.4.1, §4.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725. Cited by: §3.1.1.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §2.1, §4.
  • E. M. Smith, M. Williamson, K. Shuster, J. Weston, and Y. Boureau (2020) Can you put it all together: evaluating conversational agents’ ability to blend skills. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2021–2030. Cited by: 2nd item.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §3.2.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §3.4.1.
  • E. M. Voorhees et al. (1999) The trec-8 question answering track report. In Trec, Vol. 99, pp. 77–82. Cited by: §3.5.2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2204–2213. Cited by: 2nd item.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2020) DialoGPT: large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 270–278. Cited by: §1, §1, §2.2.3, 1st item, §4.
  • T. Zhao, R. Zhao, and M. Eskenazi (2017)

    Learning discourse-level diversity for neural dialog models using conditional variational autoencoders

    .
    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 654–664. Cited by: §1, §2.2.2.
  • L. Zhou, J. Gao, D. Li, and H. Shum (2020) The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics 46 (1), pp. 53–93. Cited by: 4th item.

Appendix A Scoring Criteria

Table 7: Score details of four metrics in human evaluation.