PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation

09/20/2021 ∙ by Siqi Bao, et al. ∙ Baidu, Inc. 0

To explore the limit of dialogue generation pre-training, we present the models of PLATO-XL with up to 11 billion parameters, trained on both Chinese and English social media conversations. To train such large models, we adopt the architecture of unified transformer with high computation and parameter efficiency. In addition, we carry out multi-party aware pre-training to better distinguish the characteristic information in social media conversations. With such designs, PLATO-XL successfully achieves superior performances as compared to other approaches in both Chinese and English chitchat. We further explore the capacity of PLATO-XL on other conversational tasks, such as knowledge grounded dialogue and task-oriented conversation. The experimental results indicate that PLATO-XL obtains state-of-the-art results across multiple conversational tasks, verifying its potential as a foundation model of conversational AI.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The efficacy of the pre-training paradigm, where large-scale transformer models are trained with massive plain texts, has been widely recognized in natural language processing

(Devlin et al., 2019; Radford et al., 2018)

. To further boost the performance of these language models, there is a trend to enlarge the model size, dataset size, and the amount of compute used for training

(Raffel et al., 2020; Kaplan et al., 2020). Particularly, the GPT-3 model with 175 billion parameters demonstrates strong zero-shot or few-shot learning capacities without task-specific fine-tuning on downstream tasks (Brown et al., 2020).

Distinct with the general language models, dialogue generation models are usually pre-trained with human-like conversations collected from social medias. DialoGPT (Zhang et al., 2020b) attempts to train dialogue models with Reddit comments on the basis of pre-trained language models. More recently developed models, like Meena (Adiwardana et al., 2020), Blender (Roller et al., 2021), and PLATO-2 (Bao et al., 2021), achieve substantial performance improvements on multi-turn conversations. These models have been scaled up to billions of parameters and taken advantage of much more social media conversations for pre-training. Nevertheless, in dialogue generation, there still lacks a clear conclusion about the correlation between model scale and conversation quality. For instance, DialoGPT has three model sizes: 117M, 345M and 762M, where the 345M one obtains the best performance in their evaluations. Meanwhile, the human evaluations of Blender reveal that the 2.7B model achieves better performance as compared to the one with 9.4B parameters.

In this paper, we argue that the conversation quality may keep on benefiting from the enlarged model scale with appropriate pre-training designs. To this end, we explore the large-scale pre-training of dialogue generation models with up to 11B model parameters, namely PLATO-XL. To train such a large model, we adopt the architecture of unified transformer with high computation and parameter efficiency. In addition, we carry out multi-party aware pre-training to better distinguish the characteristic information in social media conversations. With such designs, PLATO-XL achieves superior performances as compared to other approaches in both Chinese and English chitchat. More specifically, PLATO-XL shows strong capability of absorbing common knowledge within its huge parameters; therefore, it significantly alleviates the well-known hallucination problem111Generation models tend to generate some plausible statements with factual errors, also known as the "hallucination" problem (Marcus, 2020; Roller et al., 2021). This problem can be alleviated by expanding the model parameters (Roberts et al., 2020) or incorporating external non-parametric memories (Lewis et al., 2020).. Besides, thanks to the multi-party aware pre-training, PLATO-XL effectively reduces the inconsistency phenomenon in multi-turn conversations.

In addition to open-domain chitchat discussed above, there are two other common conversational tasks (Gao et al., 2018): knowledge grounded dialogue, and task-oriented conversation. In the experiments, we also explore the ability of PLATO-XL as the foundation model of conversational AI. Our experimental results indicate that PLATO-XL is able to outperform other dialogue generation models across multiple conversational tasks. We will release our source code together with the English model at GitHub222Expected to be released at https://github.com/PaddlePaddle/Knover/tree/develop/projects/PLATO-XL before the end of November 2021., hoping to facilitate the frontier research in dialogue generation.

2 Related Work

2.1 Large-scale Pre-trained Language Models

The pre-training paradigm has brought substantial performance improvements in natural language processing, where large-scale transformer models are pre-trained with massive plain texts. BERT (Devlin et al., 2019) learns to capture the deep bi-directional representation for the input context and achieves remarkable breakthroughs in natural language understanding. GPT (Radford et al., 2018) and GPT-2 (Radford et al., 2019) are typical models in natural language generation, which extract uni-directional representation and perform auto-regressive generation. To further boost the performance of language models, there is a trend to enlarge the model size, dataset size, and the amount of compute used for training (Raffel et al., 2020; Kaplan et al., 2020). Particularly, GPT-3 (Brown et al., 2020) scales up to 175 billion parameters and demonstrates superior performance in the zero/few-shot settings.

Besides the above English models, there are some large-scale Chinese language models. CPM (Zhang et al., 2020c) maintains a similar model architecture as GPT with 2.6 billion parameters. CPM-2 (Zhang et al., 2021) scales up to 11 billion parameters and employs knowledge inheritance from existing models to accelerate the pre-training process. PanGu- (Zeng et al., 2021) is a huge model, with up to 200 billion parameters. The effective training is carried out on a cluster of 2048 Ascend 910 AI processors with multi-dimension parallelisms and topology-aware scheduling. ERNIE 3.0 (Sun et al., 2021)

proposes a unified framework that integrates both auto-encoding and auto-regressive networks, where knowledge graphs are also encoded into pre-training for enhanced representation. Empirical results show that this 10 billion parameter model achieves superior performance on 54 Chinese NLP tasks.

2.2 Pre-trained Dialogue Models

Unlike the plain texts for general language models, for dialogue generation pre-training, human-like conversations are collected from social medias, such as Twitter, Reddit, Sina Weibo, Baidu Tieba, etc. DialoGPT (Zhang et al., 2020b) attempts to train dialogue models with Reddit comments on the basis of pre-trained language models. Meena (Adiwardana et al., 2020) carries out the pre-training of dialogue generation directly with more social media conversations and this 2.6 billion parameter model achieves significant improvements on multi-turn conversation quality. Blender (Roller et al., 2021) proposes to fine-tune the pre-trained dialogue model with human annotated datasets to emphasize the conversational skills of engagingness, knowledge, empathy, and personality. In addition, to mitigate the safe response problem, PLATO (Bao et al., 2020) and PLATO-2 (Bao et al., 2021) propose to encode the discrete latent variable into transformer for diverse response generation. The DSTC9 challenge (Gunasekara et al., 2020) reveals that the 1.6 billion parameter PLATO-2 obtains superior performance on multiple conversational tasks.

Besides the English version, PLATO-2 has one Chinese dialogue model of 363 million parameters, exhibiting prominent improvements over the classical chatbot of XiaoIce (Zhou et al., 2020). There are some other Chinese dialogue models on a similar modest scale, including CDial-GPT (Wang et al., 2020) and ProphetNet-X (Qi et al., 2021). Recently, one Chinese dialogue model of EVA (Zhou et al., 2021) is developed under the architecture of Seq2Seq, with up to 2.8 billion parameters. In this paper, we will introduce the 11 billion parameter model of PLATO-XL, trained on both Chinese and English social media conversations.

3 Plato-Xl

3.1 Network Overview

The network overview of PLATO-XL is shown in Figure 1, with transformer blocks as the backbone. For the sake of efficient training on a large scale, PLATO-XL keeps the adoption of the unified transformer (also known as prefix LM) instead of the typical encoder-decoder for dialogue generation (Bao et al., 2020, 2021)

. The advantages brought by the unified transformer architecture are two-fold, including the computation and parameter efficiency. Firstly, given the conversation samples of variable lengths, it is necessary to pad them into a certain length in the training process, which inevitably incurs massive invalid computations. As suggested in fairseq

(Ott et al., 2019), the amount of padding can be minimized by grouping the input with similar lengths. By performing effective sorting on the concatenated input, invalid computations caused by padding can be reduced significantly with the unified transformer. Secondly, through the flexible mechanism of self-attention mask, the two tasks of dialogue context understanding and response generation are modeled simultaneously with shared parameters. As such, the unified transformer is more parameter-efficient than the encoder-decoder network (Bao et al., 2021; Du et al., 2021).

Figure 1: Network overview of PLATO-XL.

In PLATO-XL, the pre-training objective is to minimize the negative log-likelihood (NLL) loss:

(1)

where refers to the trainable parameters of dialogue generation model and stands for the pre-training data. The input to the network is a pair of dialogue context and target response . is the length of the target response and denotes previously generated words. As shown in Figure 1, the input representation is calculated as the sum of the corresponding token, position, type and role embeddings. The token and position embeddings are commonly used in pre-training models. The type embedding is employed to differentiate the segments of dialogue context and target response, which is also extensible for other input sources, such as persona profiles or grounded knowledge used in conversations. The role embedding is used to distinguish the characters in the multi-turn conversations, which will be explained in detail in the next subsection.

3.2 Multi-Party Aware Pre-training

As discussed in the related work, general language models are pre-trained with massive plain texts, where each training sample is usually created by one single author or user. In comparison, the dialogue models are commonly pre-trained with human-like conversations collected from public social medias, where one toy example is provided in Figure 2 for illustration. Several properties of social media conversations can be observed from this example: 1) there are multi-level comments appended to respond the contexts; 2) multiple users are actively involved in the discussion. The corresponding message tree of these comments is shown on the right-hand side. The comments along the path from the root node to any tree node can be formulated as one training sample of dialogue context and target response. However, with these social media conversations, the learned models tend to mix information from multiple characters in the context and have difficulties to generate consistent responses.

Figure 2: Left: one toy example to illustrate social media conversations. Right: corresponding message tree.

To tackle the above problem, PLATO (Bao et al., 2020) first introduces the role embedding into transformer to distinguish the characters in the dialogue context. While there is an underlying assumption in PLATO that the conversation is carried out within two characters and the role embedding is assigned alternatively. Although it is generally tenable in human-annotated conversations, things get complicated with the social medial conversations. As suggested in the former works of RNN-based response selection (Ouchi and Tsuboi, 2016; Zhang et al., 2018), user embedding is an effective technique for speaker and addressee identification in multi-party conversation. In PLATO-XL, we further encode the multi-party aware role embedding in the pre-training of dialogue generation. The target response and utterances in the context by the same user will be assigned with the role embedding of . For the rest utterances, the role embedding will be assigned in a relative order according to the user ids, such as , and so on. This multi-party aware pre-training helps the model distinguish the information in the context and maintain the consistency in dialogue generation.

3.3 Pre-training Settings

For the pre-training corpora, the English conversation samples are extracted from Reddit comments, which are collected by a third party and made publicly available at pushshift.io (Baumgartner et al., 2020). To guarantee the data quality, we follow the elaborate cleaning process as PLATO-2 (Bao et al., 2021). After filtering, the data is split into training and validation sets in chronological order. The training set contains 811M (context, response) samples, ranging from December 2005 to December 2019. For the validation set, 0.2M samples are selected from the rest data after December 2019. The English vocabulary contains 8K BPE tokens (Sennrich et al., 2016), constructed with the SentencePiece library. The Chinese pre-training data is collected from public domain social medias. After filtering, there are 1.2B (context, response) samples in the training set. As for the Chinese vocabulary, it contains 30K BPE tokens.

PLATO-XL has the same network architecture for the Chinese and English models, with up to 11 billion parameters. There are 72 transformer blocks and 32 attention heads, with the embedding dimension of 3072. The hidden dimension of the feedforward layer is set to 18432. Pre-normalization connection and scaled initialization (Radford et al., 2019) are adopted for the sake of stable training. The main hyper-parameters used in the pre-training are listed as follows. The maximum sequence length for the dialogue context and target response is set to 896 and 128, respectively. We use Adam (Kingma and Ba, 2015) as the optimizer with a learning rate scheduler of linear warmup and decay. The warmup stage covers the first 200 steps and the peak learning rate is 8e-5.

The implementation of PLATO-XL is based on PaddlePaddle platform. And the training was carried out on 256 Nvidia Tesla V100 32G GPU cards. Given the limited memory of each device, vanilla data parallelism cannot support the training of such a model with up to 11 billion parameters. As such, we adopt the sharded data parallelism (Rajbhandari et al., 2020) to eliminate memory redundancies, by partitioning the optimizer states, gradients and parameters across multiple devices. This kind of distributed training helps remain low communication volume and high computational granularity. In addition, to train the model with a relatively large batch size, we further employ gradient checkpointing (Chen et al., 2016) to trade computation for memory. In PLATO-XL, each model was trained for a total of 150B tokens, with a batch size of 2M tokens.

4 Experiments

4.1 Evaluation Settings

4.1.1 Compared Approaches

To evaluate the performance of PLATO-XL, the following English and Chinese dialogue generation models have been compared in the experiments.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • DialoGPT (Zhang et al., 2020b) is trained on the basis of GPT-2 (Radford et al., 2019) using Reddit comments. There are three model sizes: 117M, 345M and 762M. Since the 345M parameter model obtains the best performance in their evaluations, we compare with this version.

  • Blender (Roller et al., 2021) is first trained using Reddit comments and then fine-tuned with human annotated conversations – BST (Smith et al., 2020), to help emphasize desirable conversational skills of engagingness, knowledge, empathy and personality. Blender has three model sizes: 90M, 2.7B and 9.4B. Since the 2.7B parameter model obtains the best performance in their evaluations, we compare with this version.

  • PLATO-2 (Bao et al., 2021) is trained via curriculum learning, where a coarse-grained model is first learned for general response generation and a fine-grained model is further learned for diverse response generation. The English model of PLATO-2 is pre-trained with Reddit comments and then fine-tuned with BST conversations. There are 1.6B parameters in this model. PLATO-2 also has one Chinese model of 336M parameters, trained with 1.2B social media conversation samples.

  • CDial-GPT (Wang et al., 2020) is trained on the basis of a Chinese GPT model using LCCC conversations. There are 95.5M parameters in this model.

  • ProphetNet-X (Qi et al., 2021) is a family of pre-trained models on various languages and domains. ProphetNet-X includes one Chinese dialogue generation model trained on social media conversations collected from Douban group333https://www.douban.com/group/. There are 379M parameters in this model.

  • EVA (Zhou et al., 2021) is 2.8B parameter Chinese dialogue generation model trained with the WDC-Dialogue dataset, which includes 1.4B conversation samples collected from social medias.

In addition to the above models, PLATO-XL is also compared with the following commercial chatbots in Chinese: Microsoft XiaoIce (Zhou et al., 2020), Turing Robot444http://www.turingapi.com/, Tmall Genie555https://bot.tmall.com/, and Xiao AI666https://xiaoai.mi.com/. The official platform/API is used in the interactions with XiaoIce and Turing. As there is no public API for Tmall Genie or Xiao AI, voice interactions are carried out instead with these smart speakers.

4.1.2 Evaluation Metrics

As suggested in the empirical study (Liu et al., 2016), the correlation between automatic metrics and human judgments is weak in open-domain dialogue generation. Therefore, we mainly rely on human evaluations in the experiments of open-domain conversation. Crowd-sourcing workers are asked to evaluate the conversation quality on the following aspects.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • Coherence is an utterance-level metric, measuring whether the response is relevant and consistent with the context.

  • Informativeness is also an utterance-level metric, evaluating whether the response is informative or not given the context.

  • Engagingness is a dialogue-level metric, assessing whether the annotator would like to talk with the speaker for a long conversation.

The scale of the above metrics is [0, 1, 2]. The higher score, the better. To further analyze the conversation quality, two more fine-grained metrics are included in the evaluation.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • Inconsistency is one fine-grained metric for coherence evaluation, checking whether the response has conflicts with the context.

  • Hallucination is one fine-grained metric for informativeness evaluation, checking whether the response contains any factual errors.

The scale of inconsistency and hallucination is [0, 1]. The lower score, the better. Score details about these metrics are provided in the Appendix.

4.2 Experimental Results

4.2.1 Self-Chat Evaluation

Self-chats have been widely used in the evaluation of dialogue systems (Li et al., 2016; Bao et al., 2019; Roller et al., 2021), where a model plays the role of both partners in the conversation. Following the experimental settings in PLATO-2, the interactive conversation is started with a randomly selected topic and the model performs self-chats for 5 rounds. Then 50 conversations are selected and distributed to crowd-sourcing workers for evaluation. Each conversation is evaluated by three annotators and the final score is determined through majority voting. The English and Chinese self-chat evaluation results are summarized in Table 1 and 2, respectively. These results indicate that PLATO-XL is able to produce coherent, informative, and engaging conversations. Particularly, both the inconsistency and hallucination problems of dialogue generation are alleviated remarkably with PLATO-XL. As compared to other approaches, the 11B parameter model achieves superior performances in both Chinese and English chitchat.

Table 1: English self-chat evaluation results, with best value written in bold.
Table 2: Chinese self-chat evaluation results, with best value written in bold.
Table 3: Chinese human-bot chat evaluation results, with best value written in bold.
Figure 3: Cherry-picked English self-chat examples by PLATO-XL.
Figure 4: Cherry-picked Chinese human-bot chat example by PLATO-XL.

4.2.2 Human-Bot Chat Evaluation

Besides the above public models, PLATO-XL is also compared with the following commercial chatbots in Chinese: Microsoft XiaoIce, Turing Robot, Tmall Genie, and Xiao AI. As most of them do not have publicly available APIs, we ask our in-house annotation team to collect the human-bot conversations. The interactive conversation also starts with a pre-selected topic and continues for 7-14 rounds. 20 diverse topics are extracted from the high-frequency topics of a commercial chatbot, including travel, movie, hobby and so on. The collected human-bot conversations are distributed to crowd-sourcing workers for evaluation. The human-bot chat evaluation results are summarized in Table 3

. These results indicate that PLATO-XL achieves significant improvements over the rest commercial chatbots across all the human evaluation metrics.

4.2.3 Case Analysis

To further analyze the model’s features, two English self-chat examples by PLATO-XL are provided in Figure 3. These examples demonstrate that PLATO-XL is able to conduct coherent, informative, and engaging conversations. The in-depth discussions on nuclear energy and Mariana Trench indicate that massive knowledge has been absorbed implicitly in the tremendous parameters. Moreover, from the self-chat example on the left-hand side, it can be observed that the model maintains well the characteristics of each participant. P2 seems like a curious learner, tending to ask a lot of questions. P1 is a knowledgeable expert, providing the answers in detail but with a little impatience. The model is capable to generate responses with good consistency on content and style, thanks to the multi-party aware pre-training.

One Chinese human-bot chat example by PLATO-XL is provided in Figure 4, with original interactive logs shown on the left and translated logs on the right. In this example, PLATO-XL even exhibits advanced conversational skills, such as compliment and eloquence. The model replies the other partner with sweet words from romantic lyrics and provides reasonable explanations towards the queries.

4.3 Explorations on other Conversational Tasks

In addition to open-domain chitchat, there are two other common conversational tasks (Gao et al., 2018): knowledge grounded dialogue, and task-oriented conversation. As such, in the experiments, we also explore the ability of PLATO-XL on these conversational tasks.

Table 4: Automatic evaluation results on knowledge grounded and task-oriented conversations, with best value written in bold.

4.3.1 Task Descriptions

The experiments are carried out on the following conversational tasks:

  • [leftmargin=*,noitemsep,topsep=0pt]

  • DuConv (Wu et al., 2019) is one Chinese knowledge grounded conversation dataset collected in LUGE777LUGE, Language Understanding and Generation Evaluation Benchmarks, https://www.luge.ai/. DuConv focuses on proactive conversations towards pre-defined goals and includes 30K dialogues based on movie knowledge graphs.

  • DSTC9-Track1 (Kim et al., 2020) aims to incorporate external knowledge resources to reply user’s out-of-API-coverage queries and augments the dataset of MultiWOZ 2.1 (Eric et al., 2020) with 22K knowledge grounded conversation turns. There are three tasks in DSTC9-Track1: knowledge-seeking turn detection, knowledge selection, and knowledge-grounded response generation. In the experiments, we consider the task of knowledge-grounded response generation.

  • MultiWOZ 2.2 (Zang et al., 2020) is a polished version of MultiWOZ 2.1, including 10K task-oriented conversations across multiple domains. In the experiments, we consider the classical task of dialog state tracking (DST).

4.3.2 Automatic Evaluation

The fine-tuning experiments of PLATO-XL are carried out on these conversational tasks, with automatic evaluation results summarized in Table 4.

  • [leftmargin=*,noitemsep,topsep=0pt]

  • In DuConv, the model needs to generate the response given related knowledge triplets and lead the conversation to a pre-defined goal. By expanding the network input of PLATO-XL, the conversational goal and knowledge triplets can be easily encoded and grounded for response generation. As compared to the previous state-of-the-art approach – GOKC (Bai et al., 2021), PLATO-XL improves the F1 value by 2.05 points.

  • In DSTC9-Track1, we focus on the evaluation of knowledge grounded response generation. In the experiments, we train and test the models with golden retrieved knowledge snippets. The winner approach in DSTC9-Track1 – Knover (He et al., 2021), is also developed on pre-trained dialogue models. The comparison reveals that PLATO-XL further improves the performance by 1.62 points.

  • In MultiWOZ 2.2, PLATO-XL learns to generate the dialog state directly given the context. The state-of-the-art result on the leaderboard of MultiWOZ 2.2 is produced by DS-DST (Zhang et al., 2020a). In comparison, PLATO-XL improves the performance a lot by 5.46 points.

The superior performance of PLATO-XL on multiple conversational tasks verifies its potential as a foundation model of conversational AI.

5 Conclusion

In this paper, we explore the large-scale pre-training of dialogue generation and present the 11 billion parameter model of PLATO-XL. Experimental results demonstrate that PLATO-XL achieves superior performance as compared with other approaches in both Chinese and English chitchat. Particularly, it is shown that the problems of hallucination and inconsistency are alleviated remarkably in PLATO-XL, which mainly attribute to the implicit knowledge absorbed in the tremendous parameters and the multi-party aware pre-training. Besides the open-domain conversation, PLATO-XL also obtains state-of-the-art results on multiple knowledge grounded and task-oriented conversations, verifying its capacity as a foundation model of conversational AI.

Acknowledgments

We would like to thank Jingzhou He, Tingting Li, and Shiwei Huang for the help on resource coordination; Jianzhong Liang, and Long Li for the support on PaddlePaddle implementation; Baotong Luo, and Dou Hong for the assistance with infrastructure. This work was supported by the Natural Key Research and Development Project of China (No. 2018AAA0101900).

References

  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, and Q. V. Le (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. External Links: Link Cited by: §1, §2.2.
  • J. Bai, Z. Yang, X. Liang, W. Wang, and Z. Li (2021) Learning to copy coherent knowledge for response generation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 35, pp. 12535–12543. External Links: Link Cited by: 1st item.
  • S. Bao, H. He, F. Wang, R. Lian, and H. Wu (2019) Know more about each other: evolving dialogue strategy via compound assessment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5382–5391. External Links: Link, Document Cited by: §4.2.1.
  • S. Bao, H. He, F. Wang, H. Wu, H. Wang, W. Wu, Z. Guo, Z. Liu, and X. Xu (2021) PLATO-2: towards building an open-domain chatbot via curriculum learning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2513–2525. External Links: Link Cited by: §1, §2.2, §3.1, §3.3, 3rd item.
  • S. Bao, H. He, F. Wang, H. Wu, and H. Wang (2020) PLATO: pre-trained dialogue generation model with discrete latent variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 85–96. External Links: Link Cited by: §2.2, §3.1, §3.2.
  • J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn (2020) The pushshift reddit dataset. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 14, pp. 830–839. External Links: Link Cited by: §3.3.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, pp. 1877–1901. External Links: Link Cited by: §1, §2.1.
  • T. Chen, B. Xu, C. Zhang, and C. Guestrin (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. External Links: Link Cited by: §3.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.1.
  • Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang (2021) All nlp tasks are generation tasks: a general pretraining framework. arXiv preprint arXiv:2103.10360. External Links: Link Cited by: §3.1.
  • M. Eric, R. Goel, S. Paul, A. Sethi, S. Agarwal, S. Gao, A. Kumar, A. Goyal, P. Ku, and D. Hakkani-Tur (2020) MultiWOZ 2.1: a consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 422–428. External Links: Link Cited by: 2nd item.
  • J. Gao, M. Galley, and L. Li (2018) Neural approaches to conversational AI. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 2–7. External Links: Link, Document Cited by: §1, §4.3.
  • C. Gunasekara, S. Kim, L. F. D’Haro, A. Rastogi, Y. Chen, M. Eric, B. Hedayatnia, K. Gopalakrishnan, Y. Liu, C. Huang, D. Hakkani-Tür, J. Li, Q. Zhu, L. Luo, L. Liden, K. Huang, S. Shayandeh, R. Liang, B. Peng, Z. Zhang, S. Shukla, M. Huang, J. Gao, S. Mehri, Y. Feng, C. Gordon, S. H. Alavi, D. Traum, M. Eskenazi, A. Beirami, Eunjoon, Cho, P. A. Crook, A. De, A. Geramifard, S. Kottur, S. Moon, S. Poddar, and R. Subba (2020) Overview of the ninth dialog system technology challenge: dstc9. arXiv preprint arXiv:2011.06486. External Links: Link Cited by: §2.2.
  • H. He, H. Lu, S. Bao, F. Wang, H. Wu, Z. Niu, and H. Wang (2021) Learning to select external knowledge with multi-scale negative sampling. arXiv preprint arXiv:2102.02096. External Links: Link Cited by: 2nd item.
  • J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020) Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. External Links: Link Cited by: §1, §2.1.
  • S. Kim, M. Eric, K. Gopalakrishnan, B. Hedayatnia, Y. Liu, and D. Hakkani-Tur (2020) Beyond domain APIs: task-oriented conversational modeling with unstructured knowledge access. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 278–289. External Links: Link Cited by: 2nd item.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, External Links: Link Cited by: §3.3.
  • P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, pp. 9459–9474. External Links: Link Cited by: footnote 1, footnote 1, footnote 1, footnote 1.
  • J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016)

    Deep reinforcement learning for dialogue generation

    .
    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202. External Links: Link, Document Cited by: §4.2.1.
  • C. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132. External Links: Link, Document Cited by: §4.1.2.
  • G. Marcus (2020) The next decade in AI: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177. External Links: Link Cited by: footnote 1, footnote 1, footnote 1, footnote 1.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53. External Links: Link, Document Cited by: §3.1.
  • H. Ouchi and Y. Tsuboi (2016) Addressee and response selection for multi-party conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2133–2143. External Links: Link, Document Cited by: §3.2.
  • W. Qi, Y. Gong, Y. Yan, C. Xu, B. Yao, B. Zhou, B. Cheng, D. Jiang, J. Chen, R. Zhang, H. Li, and N. Duan (2021) ProphetNet-X: large-scale pre-training models for english, chinese, multi-lingual, dialog, and code generation. arXiv preprint arXiv:2104.08006. External Links: Link Cited by: §2.2, 5th item.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report, OpenAI. External Links: Link Cited by: §1, §2.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Technical report, OpenAI. External Links: Link Cited by: §2.1, §3.3, 1st item.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    .

    Journal of Machine Learning Research

    21 (140), pp. 1–67.
    External Links: Link Cited by: §1, §2.1.
  • S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He (2020) Zero: memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. External Links: Link Cited by: §3.3.
  • A. Roberts, C. Raffel, and N. Shazeer (2020) How much knowledge can you pack into the parameters of a language model?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 5418–5426. External Links: Link, Document Cited by: footnote 1, footnote 1, footnote 1, footnote 1.
  • S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, K. Shuster, E. M. Smith, Y. Boureau, and J. Weston (2021) Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, External Links: Link Cited by: §1, §2.2, 2nd item, §4.2.1, footnote 1, footnote 1, footnote 1, footnote 1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725. External Links: Link, Document Cited by: §3.3.
  • E. M. Smith, M. Williamson, K. Shuster, J. Weston, and Y. Boureau (2020)

    Can you put it all together: evaluating conversational agents’ ability to blend skills

    .
    In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2021–2030. External Links: Link, Document Cited by: 2nd item.
  • Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, X. Ouyang, D. Yu, H. Tian, H. Wu, and H. Wang (2021) ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137. External Links: Link Cited by: §2.1.
  • Y. Wang, P. Ke, Y. Zheng, K. Huang, Y. Jiang, X. Zhu, and M. Huang (2020) A large-scale chinese short-text conversation dataset. In CCF International Conference on Natural Language Processing and Chinese Computing, pp. 91–103. External Links: Link Cited by: §2.2, 4th item.
  • W. Wu, Z. Guo, X. Zhou, H. Wu, X. Zhang, R. Lian, and H. Wang (2019) Proactive human-machine conversation with explicit conversation goal. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3794–3804. External Links: Link, Document Cited by: 1st item.
  • X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang, and J. Chen (2020) MultiWOZ 2.2 : a dialogue dataset with additional annotation corrections and state tracking baselines. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pp. 109–117. External Links: Link, Document Cited by: 3rd item.
  • W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li, Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo, Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi, F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang, Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan, Y. Wang, X. Jin, Q. Liu, and Y. Tian (2021) PanGu-: large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369. External Links: Link Cited by: §2.1.
  • J. Zhang, K. Hashimoto, C. Wu, Y. Wang, S. Y. Philip, R. Socher, and C. Xiong (2020a)

    Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking

    .
    In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, pp. 154–167. External Links: Link Cited by: 3rd item.
  • R. Zhang, H. Lee, L. Polymenakos, and D. Radev (2018) Addressee and response selection in multi-party conversations with speaker interaction rnns. In Thirty-Second AAAI Conference on Artificial Intelligence, External Links: Link Cited by: §3.2.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2020b) DialoGPT: large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 270–278. External Links: Link, Document Cited by: §1, §2.2, 1st item.
  • Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y. Yao, F. Qi, J. Guan, P. Ke, Y. Cai, G. Zeng, Z. Tan, Z. Liu, M. Huang, W. Han, Y. Liu, X. Zhu, and M. Sun (2021) CPM-2: large-scale cost-effective pre-trained language models. arXiv preprint arXiv:2106.10715. External Links: Link Cited by: §2.1.
  • Z. Zhang, X. Han, H. Zhou, P. Ke, Y. Gu, D. Ye, Y. Qin, Y. Su, H. Ji, J. Guan, F. Qi, X. Wang, Y. Zheng, G. Zeng, H. Cao, S. Chen, D. Li, Z. Sun, Z. Liu, M. Huang, W. Han, J. Tang, J. Li, and M. Sun (2020c) CPM: a large-scale generative chinese pre-trained language model. arXiv preprint arXiv:2012.00413. External Links: Link Cited by: §2.1.
  • H. Zhou, P. Ke, Z. Zhang, Y. Gu, Y. Zheng, C. Zheng, Y. Wang, C. H. Wu, H. Sun, X. Yang, B. Wen, X. Zhu, M. Huang, and J. Tang (2021) EVA: an open-domain chinese dialogue system with large-scale generative pre-training. arXiv preprint arXiv:2108.01547. External Links: Link Cited by: §2.2, 6th item.
  • L. Zhou, J. Gao, D. Li, and H. Shum (2020) The design and implementation of XiaoIce, an empathetic social chatbot. Computational Linguistics 46 (1), pp. 53–93. External Links: Link, Document Cited by: §2.2, §4.1.1.

Appendix A Scoring Criteria in Human Evaluation

The criteria used in human evaluation are provided in Table 5.

Table 5: Score details of metrics used in human evaluation.