In recent years, numerous focus has been investigated to build open-domain dialogue systems, which require generating responses based on users’ input posts in open domains. Early works on open-domain dialogue systems mainly depend on RNN-based sequence-to-sequence (Seq2Seq) models Vinyals and Le (2015); Shang et al. (2015). With the development of pre-trained language models such as GPT Radford et al. (2018), BART Lewis et al. (2020) and T5 Raffel et al. (2020), latest works in this area resort to building open-domain dialogue systems based on large-scale generative pre-training models, which include DialoGPT Zhang et al. (2020a), Meena Adiwardana et al. (2020) and Blender Roller et al. (2021). Equipped with large amounts of dialogue data collected from social media, these models can generate human-like responses and improve the engagingness of human-bot conversations.
However, most of the dialogue models based on large-scale pre-training are built in English. We argue that existing works on open-domain Chinese dialogue systems are limited in model and data sizes. For example, CDial-GPT Wang et al. (2020) (with 104M parameters) is pre-trained on 12M Chinese dialogues from Weibo222https://weibo.com/. PLATO-2 Bao et al. (2020) (with 336M parameters) is pre-trained on 1.2B Chinese dialogues from social media. The scale of the publicly available dialogue data hinders us from building Chinese pre-trained dialogue models that can generate high-quality responses on open-domain topics.
|LCCC-base Wang et al. (2020)||6.8M||20.0M||232.3M||2.9||11.6||911MB|
|LCCC-large Wang et al. (2020)||12.0M||32.9M||380.1M||2.7||11.6||1.5GB|
|PLATO-2 Bao et al. (2020)||1.2B||-||-||-||-||-|
|STC Shang et al. (2015)||4.4M||8.9M||158.1M||2||25.2||642MB|
|Douban Conversation Wu et al. (2017)||1.0M||7.1M||131.7M||6.7||18.6||535MB|
|PersonalDialog Zheng et al. (2019)||20.8M||56.2M||525.9M||2.7||9.4||2.1GB|
|PchatbotW Qian et al. (2021)||139.0M||278.9M||8.5B||2||30.5||50GB|
|PchatbotL Qian et al. (2021)||59.4M||118.9M||3.0B||2||25.5||19GB|
In this paper, we build an open-domain Chinese dialogue system called EVA, which contains the largest Chinese dialogue model with 2.8B parameters and is pre-trained on WDC-Dialogue, including 1.4B Chinese dialogue data from different domains. First, we construct the WDC-Dialogue dataset by collecting the repost, comment, and Q&A data from various social media platforms and refactor them into dialogue sessions. Strict filtering rules are also devised to ensure the quality of the WDC-Dialogue dataset. Second, we train a large-scale Transformer-based encoder-decoder model on the Chinese dialogue data. To verify the effectiveness of our model, we conduct extensive automatic evaluation and human evaluation. In the automatic evaluation, we test our model on four datasets to show the generation ability when dealing with different categories of contexts. Moreover, observational and interactive human evaluations are also adopted to evaluate our model in real human-bot conversation scenarios. Finally, we provide an interactive demonstration system for users to converse with EVA.
Our contributions are mainly as follows:
We collect the largest Chinese dialogue dataset called WDC-Dialogue from different domains, which contains 1.4B context-response pairs. The data quality is controlled by strict rules.
We build an open-domain dialogue system called EVA , which contains the largest Chinese pre-trained dialogue model with 2.8B parameters. Extensive experiments on automatic and human evaluation show the effectiveness of our model.
We release an interactive demonstration system for users to converse with EVA on open-domain topics.
We construct a dataset named WDC-Dialogue from Chinese social media to train EVA. Specifically, conversations from various sources are gathered and a rigorous data cleaning pipeline is designed to enforce the quality of WDC-Dialogue. This section details the data collection and cleaning process used in our study.
2.1 Data Collection
Dialogues in the WDC-Dialogue dataset originate from the textual interaction among different users on the Internet. Generally, these interactions can be classified into three categories: 1) The interactions exhibited through the repost behaviour on social media; 2) The interactions established through the comment / reply action on various online forums; 3) The interactions about online question and answer (Q&A) exchanges. Each round of these textual interactions yields a dialogue session. We design specific parsing rules to extract dialogues from these three kinds of interactions.
is a common feature provided by most social media platforms, which allows users to broadcast the posts created by others and add their own comments to these original posts (such as the Quote Tweet feature on Twitter). Each repost can be further broadcast by other users, thereby forming a chain of the user reply. This chain can be refactored in a dialogue session.
In practice, we observe that such interaction pattern yields a reply tree, in which the root node is the original post, and the other nodes consist of the comments added in the broadcasting process. Each node may have multiple child nodes, which denote the comments left in the reposting process when broadcasting this node. Once the reply tree is constructed, each path from the root to the leaf can be regarded as a dialogue session.
In this study, we target several Chinese social platforms. Specifically, the raw data of reposts are first collected and further parsed to construct the reply trees. Dialogues are obtained using a Depth-First-Search algorithm to traverse all the paths from each root node to their leaf nodes. This process helps to collect dialogues containing multiple turns of interaction.
is another common feature that facilitates textual interactions among different users who surf the Internet. It allows users to share their opinion by leaving textual comments, which can be further replied to by others. Such an interaction pattern can be regarded as a form of conversation among users.
In this study, we target various Chinese forums. The raw data of posts and their following comments are collected. Ideally, these raw data can also be parsed to form a reply tree because each comment may have multiple replies. However, compared with the repost data, the collection of comment data is less flexible because some of the HTML pages from the front-end interface of forums do not provide the detailed reply information of each comment. As a consequence, the depth of reply trees is limited and the dialogues obtained based on comment data have shorter turns comparing to dialogues originating from repost data.
is a special kind of interaction among users on the Internet. In online Q&A platforms such as Quora333https://www.quora.com/ or Zhihu444https://www.zhihu.com/, users post their questions related to various topics attached with a detailed description. Other people tend to provide answers with lots of backgrounds, opinions, experiences, and knowledge. We regard a post and each of its corresponding answers as a single-turn conversation.
2.2 Data Quality Control
Textual data from online social media carry various noises such as advertisements, hate speech, profanity and informal internet slang. Some of the contents even carry sensitive information such as user privacy. Models trained on these data can easily bias to these noisy contents. To improve the quality of the WDC-Dialogue dataset, we design a rigorous process to clean the dialogues.
We follow a similar process used by Wang et al. (2020) to filter out noisy contents with a series of rules: (1) delete the platform-related tags in the dialogues, such as "Reply to @***" and "Repost//@***"; (2) remove URL strings from the text; (3) split conversations with more than 30 turns into multiple conversations less than 30 turns Shang et al. (2015); (4) only keep one copy of the phrases or words that repeat more than 6 times in one sentence; (5) remove dialogues that contain responses that are too long or too short; (6) remove dialogues if the response is identified as an advertisement by the method introduced in Wang et al. (2013); (7) remove dialogues if 90% of tri-grams in the response are high-frequency tri-grams Zhang et al. (2020a); (8) remove dialogues if the response has some specific forms of generic responses; (9) remove dialogues in which the response is the same as the post.
We also manually construct a word list containing the following noise: (1) dirty words, sensitive words, and dialect; (2) special topic words such as the name of some rare virus or compound; (3) name, appellation and unknown abbreviation; (4) special symbols and emojis; (5) platform signs such as the words which are related to ads, pictures, and videos. A dialogue will be removed from our dataset if it contains words in this word list.
2.3 Data Statistics
Table 1 shows a statistics of the filtered WDC-Dialogue dataset and other Chinese dialogue datasets. To the best of our knowledge, WDC-Dialogue is the largest Chinese dialogue dataset with 1.4B context-response pairs and the largest number of utterances, tokens and storage size.
EVA is a Transformer-based dialogue model with a bi-directional encoder and a uni-directional decoder Vaswani et al. (2017). We present the EVA’s model details and a comparison with previous large-scale Chinese pre-trained dialogue models in Table 2. EVA is nearly 8 times the size of the previous largest Chinese dialogue model, PLATO-2.
As Chinese words, containing some specific meanings, are usually composed of several characters, traditional character-level vocabulary loses the important semantics of Chinese words or phrases. Thus, we construct a sub-word vocabulary, containing both Chinese characters and Chinese words, based on the word segmented corpus using unigram language model Kudo and Richardson (2018). The sub-word vocabulary contains 30,000 tokens.
3.3 Pre-Training Details
We use the sequence-to-sequence language modeling Sutskever et al. (2014) task to train our model. Specifically, for a dialogue session with utterances, we train the model to generate the utterance from the decoder conditioned on the previous utterances, which are fed to the encoder. The model is trained with the teacher-forcing paradigm.
To reduce the GPU memory consumption, we adopt mixed-precision training Micikevicius et al. (2018) and ZeRO (stage-1) Rajbhandari et al. (2020) to partition the parameters of the optimizer to multiple data parallelism process.
We set the maximum encoder length and maximum decoder length as 128 to ensure that most utterances are not truncated during training. However, short utterances are heavily padded if we view each context-response pair as a data sample; such heavy padding is a bottleneck in pre-training efficiency. To address the challenge, we propose a data sampling strategy that allows a data sample to contain multiple context-response pairs, as illustrated in Figure1. Specifically, we concatenate multiple context-response pairs as a data sample and distinguish different pairs with attention masks for the encoder self-attention, decoder self-attention, and cross attention. Note that EVA adopts relative position embeddings Raffel et al. (2020), which is compatible with our data sampling strategy.
We collect four datasets which have no overlap with our pre-training corpus to test pre-trained dialogue models in a zero-shot setting. These test sets indicate the following dialogue scenarios: 1) Single: This test set contains the dialogue with only one utterance as the context. 2) Multi: This test set includes the dialogue with multiple utterances as the context. 3) Long: This test set contains the dialogues where the length of responses is longer than that of contexts. 4) QA: This test set includes the dialogues where the last utterance of contexts is a question. The statistics of these four test sets are shown in Table 3.
We adopt several Chinese pre-trained models as our baselines:
CDial-GPT: This Chinese pre-trained dialogue model with 104M parameters is pre-trained on LCCC, which contains 12M dialogue sessions Wang et al. (2020).
CPM: This model is a general Chinese pre-trained model with 2.6B parameters, which is pre-trained on 100GB Chinese data including encyclopedia, news, novels, and Q&A Zhang et al. (2020b). Since CPM cannot be directly applied to generating responses for dialogue contexts, we follow the original paper to condition the language model on a prompt of several example context-response pairs.
Note that we do not choose PLATO-2 Bao et al. (2020) as our baseline because the authors have not released the Chinese pre-trained dialogue model.
4.3 Automatic Evaluation
and Distinct n-grams (Dist-n)Li et al. (2016) as automatic metrics. The former three metrics evaluate the relevance between the generated responses and the references, while the last one measures the diversity of generated responses.
The results on the four test sets and the overall results are provided in Table 4. We can see that EVA outperforms both competitors on the relevance metrics, which shows that our model can generate high-quality responses that have more overlap with human references. EVA also surpasses CDial-GPT in terms of diversity, while performing worse than CPM. We conjecture that the high diversity of CPM may result from more diverse pre-training corpora (rather than the mere dialogue corpus adopted by EVA). We also observe that EVA achieves relatively stable performance on four test sets, indicating that EVA can deal with different kinds of contexts.
4.4 Observational Human Evaluation
to adopt sensibleness, specificity, and the average of sensibleness and specificity (SSA) as our evaluation metrics. Specifically, sensiblenessmeasures whether the response is fluent and readable and is coherent to the context. Specificity measures whether the responses is specific and informative. Note that the specificity score will be set to 0 if the sensibleness score is 0. We randomly sampled 200 dialogues from the test set, where each dialogue is judged by 3 annotators.
The results are shown in Figure 2. We can see that EVA achieves remarkably higher specificity and SSA scores than baselines, and obtains a comparable sensibleness score to CDial-GPT. It demonstrates that EVA can generate more specific informative responses while maintaining good fluency and contextual coherence. By contrast, CDial-GPT tends to generate safe and generic responses, which are fluent in most cases while containing unspecific meanings. Thus, it obtains the lowest specificity among the three models.
4.5 Interactive Human Evaluation
We also follow the existing work Adiwardana et al. (2020) to conduct interactive human evaluation to simulate the real scenarios of human-bot conversations. We adopt the same metrics as the observational human evaluation. For each system, we asked participants to converse with it for at least 10 turns (5 from users and 5 from systems), and score every utterance from the system based on sensibleness and specificity. We totally evaluated 60 sessions for each system.
The results are shown in Figure 3. We can observe that EVA obtains the highest sensibleness, specificity and SSA scores, showing the strongest ability of response generation in multi-turn human-bot interaction.
5 Case Study
To intuitively show the generation ability of our model, we provide some generated cases in Table 5 and Table 6. We can observe from Table 5 that EVA can generate more specific and relevant responses compared with the baselines. Table 6 shows a challenging case in interactive human evaluation, which demonstrates the strong ability of EVA in multi-turn human-bot interactions.
6 Interactive Demo
In addition to the above dialogue system, we also release an interactive demonstration system that enables researchers to converse with our system conveniently. It is an front-end interactive toolkit which can be easily built up through a JSON configuration file to communicate with the back-end dialogue system. In addition to the EVA model, developers can also deploy any other dialogue systems through the configuration file. As shown in the left side of Figure 4, there are 8 different dialog models. By using the interactive demonstration system, users can communicate with the dialogue system and provide their ratings to the system response. Our main concern is the sensibleness and specificity of system response. As shown in the right side of Figure 4, the system response of the EVA model in the first turn is both sensible and specific, and the response of the second turn is sensible but not that specific.
We propose an open-domain Chinese dialogue system called EVA, which contains the largest Chinese pre-trained dialogue model with 2.8B parameters. To train EVA, we collect the largest Chinese dialogue dataset called WDC-Dialogue containing 1.4B context-response pairs, which is filtered by strict and effective rules. We conduct extensive experiments on automatic and human evaluation to show the effectiveness of our model.
- Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
- Bao et al. (2020) Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhen Guo, Zhibin Liu, and Xinchao Xu. 2020. Plato-2: Towards building an open-domain chatbot via curriculum learning. arXiv preprint arXiv:2006.16779.
Dinan et al. (2019)
Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason
Wizard of wikipedia: Knowledge-powered conversational agents.In 7th International Conference on Learning Representations.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of EMNLP.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu et al. (2016)
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and
Joelle Pineau. 2016.
How NOT to evaluate your dialogue system: An empirical study of
unsupervised evaluation metrics for dialogue response generation.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
- Micikevicius et al. (2018) Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2018. Mixed precision training. In Proceedings of ICLR.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Qian et al. (2021) Hongjin Qian, Xiaohe Li, Hanxun Zhong, Yu Guo, Yueyuan Ma, Yutao Zhu, Zhanliang Liu, Zhicheng Dou, and Ji-Rong Wen. 2021. Pchatbot: A large-scale dataset for personalized chatbot. In Proceedings of the SIGIR 2021. ACM.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. In OpenAI Technical Report.
Raffel et al. (2020)
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020.
Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21:140:1–140:67.
- Rajbhandari et al. (2020) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of SC.
- Roller et al. (2021) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325.
- Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
Sutskever et al. (2014)
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.In NIPS 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008.
- Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
- Wang et al. (2013) Hao Wang, Zhengdong Lu, Hang Li, and Enhong Chen. 2013. A dataset for research on short-text conversations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 935–945, Seattle, Washington, USA. Association for Computational Linguistics.
- Wang et al. (2020) Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, Yong Jiang, Xiaoyan Zhu, and Minlie Huang. 2020. A large-scale chinese short-text conversation dataset. In Natural Language Processing and Chinese Computing - 9th CCF International Conference, volume 12430, pages 91–103.
- Wu et al. (2017) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 496–505, Vancouver, Canada. Association for Computational Linguistics.
- Zhang et al. (2020a) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B Dolan. 2020a. Dialogpt: Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278.
- Zhang et al. (2020b) Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, YuSheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, and Maosong Sun. 2020b. CPM: A large-scale generative chinese pre-trained language model. arXiv preprint arXiv: 2012.00413.
- Zheng et al. (2019) Yinhe Zheng, Guanyi Chen, Minlie Huang, Song Liu, and Xuan Zhu. 2019. Personalized dialogue generation with diversified traits. arXiv preprint arXiv:1901.09672.