In recent years, there has been a significant increase in the research topic of conversational recommendation due to the rise of voice-based bots Kang et al. (2019); Li et al. (2018); Sun and Zhang (2018); Christakopoulou et al. (2016); Warnestal (2005). These works focus on how to provide recommendation service in a more user-friendly manner through dialog-based interactions. They fall into two categories: (1) task-oriented dialog-modeling approaches with requirement of pre-defined user intents and slots Warnestal (2005); Christakopoulou et al. (2016); Sun and Zhang (2018); (2) non-task dialog-modeling approaches that can conduct more free-form interactions for recommendation, without pre-defined user intents and slots Li et al. (2018); Kang et al. (2019). Recently more and more efforts are devoted to the research line of the second category and many datasets have been created, including English dialog datasets Dodge et al. (2016); Li et al. (2018); Kang et al. (2019); Moon et al. (2019); Hayati et al. (2020) and Chinese dialog datasets Liu et al. (2020b); Zhou et al. (2020).
However, to the best of our knowledge, almost all these datasets are constructed in the setting of a single language, and there is no publicly available multilingual dataset for conversational recommendation. Previous work on other NLP tasks have proved that multilingual corpora can bring performance improvement in comparison with monolingual task setting, such as for the tasks of task-oriented dialog Schuster et al. (2019b), semantic parsing Li et al. (2021), QA and reading comprehension Jing et al. (2019); Lewis et al. (2020); Artetxe et al. (2020); Clark et al. (2020); Hu et al. (2020); Hardalov et al. (2020), machine translationJohnson et al. (2017), document classification Lewis et al. (2004); Klementiev et al. (2012); Schwenk and Li (2018), semantic role labelling Akbik et al. (2015) and NLI Conneau et al. (2018). Therefore it is necessary to create multilingual conversational recommendation dataset that might enhance model performance when compared with monolingual training setting, and it could provide a new benchmark dataset for the study of multilingual modeling techniques.
To facilitate the study of this challenge, we present a bilingual parallel recommendation dialog dataset, DuRecDial 2.0, for multilingual and cross-lingual conversational recommendation. DuRecDial 2.0 consists of 8.2K dialogs aligned across two languages, English and Chinese (16.5K dialogs and 255K utterances in total). Table 1 shows the difference between DuRecDial 2.0 and existing conversational recommendation datasets. We also analyze DuRecDial 2.0 in-depth and find that it offers more diversified prefixes of utterances and then more flexible language style, as shown in Figure 2(a) and Figure 2(b).
We define five tasks on this dataset. As shown in Figure 1 , the first two tasks are English or Chinese monolingual conversational recommendation, where dialog context, knowledge, dialog goal, and response are in the same language. It aims at investigating the performance variation of the same model across two different languages. As shown in Figure 1 , there is another task that is called multilingual conversational recommendation. Here we directly mix training instances of the two languages into a single training set and train a single model to handle both English and Chinese conversational recommendation at the same time. As shown in Figure 1 , the last two tasks are cross-lingual conversational recommendation, where model input and output are in different languages, e.g. dialog context is in English (or Chinese) and generated response is in Chinese (or English).
To address these tasks, we build baselines using XNLG Chi et al. (2020)222https://github.com/CZWin32768/XNLG and mBART Liu et al. (2020a)333https://github.com/pytorch/fairseq/. We conduct an empirical study of the baselines on DuRecDial 2.0, and experiment results indicate that the use of additional English data can bring performance improvement for Chinese conversational recommendation.
In summary, this work makes the following contributions:
To facilitate the study of multilingual and cross-lingual conversational recommendation, we create a novel dataset DuRecDial 2.0, the first publicly available bilingual parallel dataset for conversational recommendation.
We define five tasks, including monolingual, multilingual, and cross-lingual conversational recommendation, based on DuRecDial 2.0.
We establish monolingual, multilingual, and cross-lingual conversational recommendation baselines on DuRecDial 2.0. The results of automatic evaluation and human evaluation confirm the benefits of this bilingual dataset for Chinese conversational recommendation.
|Facebook_RecDodge et al. (2016)||EN||✗||1M||6M||Rec.||Movie|
|REDIAL Li et al. (2018)||EN||✗||10k||163k||Rec., chitchat||Movie|
|GoRecDial Kang et al. (2019)||EN||✗||9k||170k||Rec.||Movie|
|OpenDialKG Moon et al. (2019)||EN||✗||12k||143k||Rec.||Movie, book|
|DuRecDial Liu et al. (2020b)||ZH||✗||10.2k||156k||Rec., chitchat, QA, task||Movie, music, star, food, restaurant, news, weather|
|TG-ReDial Zhou et al. (2020)||ZH||✗||10k||129k||Rec.||Movie|
|INSPIRED Hayati et al. (2020)||EN||✗||1k||35k||Rec.||Movie|
|DuRecDial 2.0 (Ours)||EN-ZH||✓||16.5k||255k||Rec., chitchat, QA, task||Movie, music, star, food, restaurant, weather|
2 Related Work
Datasets for Conversational Recommendation To facilitate the study of conversational recommendation, multiple datasets have been created in previous work, as shown in Table 1. The first recommendation dialog dataset is released by Dodge et al. (2016)
, which is a synthetic dialog dataset built with the use of the classic MovieLens ratings dataset and natural language templates.Li et al. (2018) creates a human-to-human multi-turn recommendation dialog dataset, which combines the elements of social chitchat and recommendation dialogs. Kang et al. (2019) provides a recommendation dialogue dataset with clear goals, and Moon et al. (2019) collects a parallel DialogKG corpus for recommendation. Liu et al. (2020b) constructs a human-to-human conversational recommendation dataset contains 4 dialog types and 7 domains, which has clear goals to achieve during each conversation, and user profiles for personalized conversation. Zhou et al. (2020) automatically collects a conversational recommendation dataset, which is built with the use of movie data. Hayati et al. (2020) provides a conversational recommendation dataset with additional annotations for sociable recommendation strategies. Compared with them, each dialogue in DuRecDial 2.0 attaching with seeker profiles, knowledge triples, a goal sequence is parallel in English and Chinese.
Multilingual and Cross-lingual Datasets for Dialog Modeling Dialogue Systems are categorized as task-oriented and chit-chat. Several multilingual task-oriented dialogue datasets have been published Mrkšić et al. (2017b); Schuster et al. (2019a), enabling evaluation of the approaches for cross-lingual dialogue systems. Mrkšić et al. (2017b) annotated two languages (German and Italian) for the dialogue state tracking dataset WOZ 2.0 Mrkšić et al. (2017a) and trained a unified framework to cope with multiple languages. Meanwhile, Schuster et al. (2019a) introduced a multilingual NLU dataset and highlighted the need for more sophisticated cross-lingual methods. Those datasets mainly focus on multilingual NLU and DST for task-oriented dialogue and are not parallel. In comparison with them, DuRecDial 2.0 is a bilingual parallel dataset for conversational recommendation. Multilingual chit-chat datasets are relatively scarce. Lin et al. (2020) propose a Multilingual Persona-Chat dataset, XPersona, by extending the Persona-Chat corpora Dinan et al. (2019) to six languages: Chinese, French, Indonesian, Italian, Korean, and Japanese. In XPersona, the training sets are automatically translated using translation APIs, while the validation and test sets are annotated by human. XPersona focuses on personalized cross-lingual chit-chat generation, while DuRecDial 2.0 focuses on multilingual and cross-lingual conversational recommendation.
3 Dataset Collection
DuRecDial 2.0 is designed to collect highly parallel data to facilitate the study of monolingual, multilingual and cross-lingual conversational recommendation.
In this section, we describe the three steps for dataset construction: (1) Constructing the parallel data item; (2) Collecting conversation utterances by crowdsourcing; (3) Collecting knowledge triples by crowdsourcing.
3.1 Parallel Data Item Construction
To collect parallel data, we follow the task design in previous work Liu et al. (2020b)
and use same annotation rules, so parallel data items (e.g., knowledge graph, user profile, task templates, and conversation situation) are essential.
Parallel knowledge graph The domains covered in DuRecDial Liu et al. (2020b) include star, movie, music, news, food, POI, and weather. As the quality of automatically translated news texts is poor, we remove the domain of news and keep other domains. For the weather domain, we construct its parallel knowledge as follows: 1) decompose Chinese weather information into some aspects of weather(e.g. the highest temperature, the lowest temperature, wind direction, etc.), 2) multiple crowdsourced annotators translate and combine English weather information to generate parallel weather information. For other domains, the edges of knowledge graph are translated by multiple crowdsourced annotators, and the nodes are constructed as follows:
We crawl the English name of movies, stars, music, food, and restaurants from several related websites for the movie444https://baike.baidu.com/ 555http://www.mtime.com 666https://maoyan.com/ /star 456/music4777https://music.163.com/888https://y.qq.com/food4999https://www.meituan.com 101010https://wenku.baidu.com/POI49 domain. If the English name of at least two websites is the same, it is used to construct the parallel knowledge graph.
If the English names are different, crowdsourced annotators choose one of the candidate English names crawled above to construct the parallel knowledge graph.
Otherwise, multiple crowdsourced annotators translate the Chinese nodes into English.
Following these rules, we finally obtain 16,556 bilingual parallel nodes and 254 parallel edges, resulting in about 123,298 parallel knowledge triplets, the accuracy of which is over 97% 111111We randomly sampled 100 triplets and manually evaluated them.. Table 2 provides the statistics of DuRecDial 2.0.
Parallel user profiles The user profile contains personal information (e.g. name, gender, age, residence city, occupation, etc.) and his/her preference on domains and entities. The personal information is translated by multiple crowdsourced annotators directly. The preference on domains and entities is replaced based on the parallel knowledge graph constructed above and then revised by crowdsourced annotators.
Parallel task templates The task templates contain: 1) a goal sequence, where each goal consists of two elements, a dialog type and a dialog topic, corresponding to a sub-dialog, 2) a detailed description about each goal. We create parallel task templates by 1) replacing dialog type and topic based on the parallel knowledge graph constructed above, and 2) translating goal descriptions.
Parallel conversation situation The construction of parallel conversation situation also includes two steps: 1) decompose situation into chat time, place and topic, 2) multiple crowdsourced annotators translate chat time, place and topic to construct parallel conversation situation.
3.2 Dataset Collection
To guarantee the quality of translation, we use a strict quality control procedure.
First, before translation, all entities in all utterances are replaced based on the parallel knowledge graph constructed above to ensure knowledge accuracy.
Then, we randomly sample 100 conversations (about 1500 utterances) and assign them to more than 100 professional translators. After translation, all translation results are assessed 1-3 times by 3 data specialists with translation experience. Specifically, data specialists randomly select 20% of each translator’s translation results for assessment. The assessment includes word-level, utterance-level, and session-level. For word-level assessment, they assess whether entities are consistent with the knowledge graph, whether the choice of words is appropriate, and whether there are typos. For utterance-level assessment, they assess whether the utterance is accurate, colloquial, and has no redundancy. For session-level assessment, they assess whether the session is coherent and is parallel to DuRecDial Liu et al. (2020b). If the error rate exceeds 10%, translators are no longer allowed to translate. If the error rate exceeds 3%, we will ask translators to fix these errors. After this second-round translation, we will conduct another assessment. In second-round assessment, if the error rate is less than 2%, translators will pass directly, otherwise, they will be assessed for the third time. In the third-round assessment, only the error rate is less than 1% can pass. Finally, we pick 23 translators.
Finally, the 23 translators translate about 1000 utterances at a time based on the parallel user profile, knowledge graph, task templates, and conversation situation. After data translation, data specialists randomly select 10-20% of each translator’s translation results for assessment in the same way as above. The translators can continue to translate only after their passing the assessment.
|DuRecDial 2.0||#Parallel dialogs||16,482|
|#Parallel sub-dialogs for QA/Rec/task/chitchat||11,326/13,640/5,198/16,482|
|#Parallel entities recommended/accepted/rejected||17,354/13,476/3,878|
3.3 Related Knowledge Triples Annotation
Due to the complexity of this task and the massive knowledge triples corresponding to each dialog, it is very challenging for knowledge selection and goal planning. In addition to translating dialogue utterances, the annotators were also required to record the related knowledge triples if the utterances are generated according to some triples.
4 Dataset Analysis
4.1 Data statistics and quality
Table 2 provides statistics of DuRecDial 2.0 and its knowledge graph, indicating rich variability of dialog types and domains. Following the evaluation method in previous work Liu et al. (2020b), we conduct human evaluations for data quality.121212A dialog will be rated “1” if it wholly follows the instruction in task templates and the utterances are grammatically correct and fluent, otherwise “0”. Then we ask three persons to judge the quality of 200 randomly sampled dialogs Finally we obtain an average score of 0.93 on this evaluation set.
4.2 Prefixes of utterances
As human-bot conversations are very diversified in real-world applications, we expect a richer variability of utterances to mimic real-world application scenarios. Figure 2(a) and Figure 2(b) show the distribution of frequent trigram prefixes. We find that nearly all prefixes of utterances in Redial Li et al. (2018) are Hello, Hi, and Hey, while the prefixes of utterances in DuRecDial 2.0 are more diversified. For example, several sectors indicated by prefixes Do, What, Who, How, Please, Play, and I are frequent in DuRecDial 2.0 but are completely absent in Redial Li et al. (2018), indicating that DuRecDial 2.0 has a more flexible language style.
5 Task Formulation on DuRecDial 2.0
Let denote a set of dialogs by the seeker , where is the number of dialogs by the seeker , and is the number of seekers. Recall that we attach each dialog (say ) with an updated seeker profile (denoted as ), a knowledge graph , a goal sequence , where is several knowledge triples, is a candidate dialog type and is a candidate dialog topic. Given a context with utterances from the dialog , , and , the aim is to produce a proper response for completion of the goal .
Monolingual conversational recommendation:
Task 1: (, , , ) or Task 2: (, , , ). With these two monolingual conversational recommendation forms, we can investigate the performance variation of the same model trained on two separate datasets in different languages. In our experiments, we train two conversational recommendation models respectively for the two monolingual tasks. Then we can evaluate their performance variation across English and Chinese to see how the changes between languages can affect model performance.
Multilingual conversational recommendation:
Task 3: (, , , , , , ,
). Similar to multilingual neural machine translationJohnson et al. (2017) and multilingual reading comprehensionJing et al. (2019), we directly mix training instances of the two languages into a single training set and train a single model to handle both English and Chinese conversational recommendation at the same time. This task setting can help us investigate if the use of additional training data in another language can bring performance benefits for a model of current language.
Cross-lingual conversational recommendation:
The two forms of crosslingual conversational recommendation are Task 4: (, , , ) and Task 5: (, , , ), where given related goals and knowledge (e.g., in Engish), the model takes dialog context in one language (e.g., in Chinese) as input, and then produce responses in another language (e.g., in Engish) as output. Understanding the mixed-language dialog context is a desirable skill for end-to-end dialog systems. This task setting can help evaluate if a model has the capability to perform this kind of cross-lingual tasks.
6 Experiments and Results
6.1 Experiment Setting
Dataset For the train/development/test set, we follow the split of Liu et al. (2020b), with one notable difference that we discard the dialogues that include news.
Automatic Evaluation Metrics:
Automatic Evaluation Metrics:For automatic evaluation from the viewpoint of conversation, we follow the setting in previous work Liu et al. (2020b) to use several common metrics such as F1, BLEU (DLEU1 and DLEU2) Papineni et al. (2002), and DISTINCT (DIST-1 and DIST-2) Li et al. (2016) to measure the relevance, fluency, and diversity of generated responses. Moreover, we also evaluate the knowledge-selection capability of each model by calculating knowledge precision/recall/F1 scores as done in Wu et al. (2019); Liu et al. (2020b).131313When calculating the knowledge precision/recall/F1, we compare the generated results with the correct knowledge. In addition, to evaluate recommendation effectiveness, we design two automatic metrics shown as follows. First, to measure how well a model can lead the whole dialog to approach a recommendation target, we design a metric dialog-Leading Success rate (LS ). It calculates the percentage of times a dialog can successfully reach or mention the target after a few dialog turns.141414We convert each multi-turn dialog into multiple (context, response) pairs, and generate a response for each context, and then evaluate LS and UTC based on the generated conversation (dialog level). Second, to measure how well a model can respond to new topics by users, we design a metric User-Topic Consistency rate (UTC). It calculates the percentage of times the model can successfully follow new topics mentioned by users.151515If the generated response is coherent with the new topic mentioned by user, we define "successfully follow the topic".
Human Evaluation Metrics: The human evaluation is conducted at the level of both turns and dialogs.
For turn-level human evaluation, we ask each model to produce a response conditioned on a given context, goal and related knowledge. The generated responses are evaluated by three persons in terms of fluency, appropriateness, informativeness, proactivity, and knowledge accuracy.161616Please see supplemental material for more details.
For dialogue-level human evaluation, we let each model converse with humans and proactively make recommendations when given goals and reference knowledge. For each model, we collect 30 dialogs. These dialogs are then evaluated by three persons in terms of two metrics: (1) coherence that examines fluency, relevancy and logical consistency of each response when given the current goal and context, and (2) recommendation success rate that measures measures the percentage of times users finally accept the recommendation at the end of a dialog.
The evaluators rate the dialogs on a scale of 0 (poor) to 2 (good) in terms of each human metric except recommendation success rate.171717Please see supplemental material for more details.
|Tasks||Methods||F1||BLEU1/ BLEU2||DIST-1/DIST-2||Knowledge P/R/F1||LS||UTC|
|1(EN->EN)||XNLG||43.78%||0.202/ 0.123||0.016/ 0.053||0.173/0.211/0.179||15.19%||11.44%|
|3(EN->EN)||XNLG||42.03%||0.199/ 0.131||0.008/ 0.021||0.171/0.207/0.173||10.03%||11.31%|
|3(ZH->ZH)||XNLG||36.61%||0.322/ 0.208||0.006/ 0.020||0.324/0.393/0.351||22.03%||20.31%|
|4(ZH->EN)||XNLG||41.98%||0.201/ 0.129||0.019/ 0.075||0.123/0.162/0.139||14.81%||13.61%|
|5(EN->ZH)||XNLG||36.53%||0.323/ 0.202||0.014/ 0.052||0.308/0.394/0.318||21.43%||20.06%|
|1(EN->EN)||mBART||66.96%||0.285/ 0.195||0.018/ 0.057||0.276/0.313/0.285||21.03%||20.26%|
|3(EN->EN)||mBART||63.69%||0.254/ 0.168||0.008/ 0.023||0.253/0.300/0.266||13.11%||18.55%|
|3(ZH->ZH)||mBART||46.23%||0.368/ 0.237||0.006/ 0.024||0.432/0.499/0.451||35.81%||31.99%|
|4(ZH->EN)||mBART||64.31%||0.267/ 0.185||0.027/ 0.084||0.229/0.261/0.236||21.37%||22.79%|
|5(EN->ZH)||mBART||53.55%||0.392 / 0.304||0.026/ 0.097||0.421/0.514/0.439||37.04%||30.86%|
|Tasks||Methods||F1||BLEU1/ BLEU2||DIST-1/DIST-2||Knowledge P/R/F1||LS||UTC|
|1(EN->EN)||XNLG||49.66%||0.265/ 0.173||0.018/ 0.050||0.244/0.291/0.260||16.31%||15.18%|
|3(EN->EN)||XNLG||44.15%||0.202/ 0.142||0.009/ 0.021||0.173/0.211/0.185||13.01%||12.31%|
|3(ZH->ZH)||XNLG||36.62%||0.329/ 0.182||0.008/ 0.023||0.328/0.405/0.359||22.29%||20.62%|
|4(ZH->EN)||XNLG||45.75%||0.239 / 0.171||0.013/ 0.036||0.217/0.259/0.211||14.55%||15.03%|
|5(EN->ZH)||XNLG||36.77%||0.330/ 0.203||0.011/ 0.053||0.331/0.393/0.355||21.57%||20.17%|
|1(EN->EN)||mBART||68.38%||0.325/ 0.245||0.017/ 0.054||0.350/0.396/0.362||28.90%||24.77%|
|3(EN->EN)||mBART||64.38%||0.268/ 0.192||0.007/ 0.024||0.307/0.367/0.325||16.11%||20.55%|
|3(ZH->ZH)||mBART||46.37%||0.366/ 0.241||0.006/ 0.025||0.412/0.493/0.436||36.31%||32.59%|
|4(ZH->EN)||mBART||67.43%||0.314/ 0.231||0.013/ 0.040||0.328/0.379/0.343||24.26%||23.83%|
|5(EN->ZH)||mBART||55.69%||0.430 / 0.325||0.019/ 0.077||0.455/0.536/0.476||38.67%||32.11%|
|Turn-level results||Dialog-level results|
|Tasks||Methods||Fluency||Appro.||Infor.||Proactivity||Know. Acc.||Coherence||Rec. success rate|
XNLG Chi et al. (2020) is a cross-lingual pre-trained model with both monolingual and cross-lingual objectives and updates the parameters of the encoder and decoder through auto-encoding and autoregressive tasks to transfer monolingual NLG supervision to other pre-trained languages. When the target language is the same as the language of training data, we fine-tune the parameters of encoder and decoder. When the target language is different from the language of training data, we fine-tune the the parameters of encoder. The objective of fine-tuning encoder is to minimize:
where and are the same as XNLG, indicates the parallel corpus, and is the monolingual corpus.
The objective of fine-tuning decoder is to minimize:
where and are the same as XNLG.
mBART Liu et al. (2020a) is a multilingual sequence-to-sequence (Seq2Seq) denoising auto-encoder pre-trained on a subset of 25 languages – CC25 – extracted from the Common Crawl (CC) Wenzek et al. (2020); Conneau et al. (2020). It provides a set of parameters that can be fine-tuned for any of the language pairs in CC25 including English and Chinese. Loading mBART initialization can provide performance gains for monolingual/multilingual/cross-lingual tasks and serves as a strong baseline.
We treat our 5 tasks as Machine Translation(MT) task. Specifically, context, knowledge, and goals are concatenated as source language input, which could be monolingual, multilingual, or cross-lingual text, then the corresponding response is generated as the target language output. Since the response could be in different languages, we also concatenate a language identifier of response to the source input. Concretely, if the response is in English, the identifier is EN, otherwise ZH, no matter what language the source input is. We finally fine-tune the mBART model on our 5 tasks respectively.
6.3 Experiment Results
Table 3 and Table 4 presents automatic evaluation results on automatic translation 181818We use https://fanyi.baidu.com/ parallel corpus and human translation parallel corpus (DuRecDial 2.0). Table 5 provides human evaluation results on DuRecDial 2.0.
Automatic Translation vs. Human Translation: As shown in Table 3 and Table 4, the models of XNLG Chi et al. (2020) and mBART Liu et al. (2020a) trained with human-translated parallel corpus (DuRecDial 2.0) are both better than those trained with machine-translated parallel corpus across almost all the tasks. The possible reason is that automatic translation might contain many translation errors, which increases the difficulty for effective learning by models.
English vs. Chinese: As shown in Table 4 and 5, the results of Chinese related tasks (Task 2, 3(ZH->ZH), 5) are better than that for English related tasks (Task 1, 3(EN->EN), 4) in terms of almost all the metrics, except for F1 and DIST1/DIST2. The possible reason is that: (1) most of entities in this dataset are from the domain of Chinese movies and famous Chinese entertainers, which are quite different from the set of entities in English pretraining corpora used for XNLG or mBART; (2) then the pretrained models perform poorly for the modeling of these entities in utterances, resulting in knowledge errors in responses (e.g., the agent might mention incorrect entities in responses that are not relevant to current topic), since some entities might never appear in the English pretraining corpora. The accuracy of generated entities in responses is very crucial to model performance in terms of Knowledge P/R/F1, LS, UTC, Know. Acc., Coherence, and Rec. success rate. Therefore incorrect entities in generated responses deteriorate model performance in terms of the above metrics for English related tasks.
Monolingual vs. Multilingual: Based on the results in Table 4 and 5, the model for multilingual Chinese task (Task 3(ZH->ZH)) are better than the monolingual Chinese model (Task 2) in terms of almost all the metrics (except for DISTINCT and Knowledge Accuracy). It indicates that the use of additional English corpora can slightly improve model performance for Chinese conversational recommendation. The possible reason is that the use of additional English data implicitly expands the training data size for Chinese related tasks through the bilingual training paradigm of XNLG or mBART, which strengthens the capability of generating correct entities for a given dialog context. Then Chinese related task models can generate correct entities in responses more frequently, leading to better model performance.
But the model for multilingual English task (Task 3(EN->EN)) can not outperform the monolingual English model (Task 1). The possible reason is that the pretrained models can not perform well on the modeling of entities in dialog utterances, resulting in poor model performance.
Monolingual vs. Cross-lingual: According to the results in Table 4 and 5, the model of EN->ZH cross-lingual task (Task 5) perform surprisingly better than the monolingual Chinese model (Task 2) in terms of all the automatic and human metrics (except for Fluency) (sign test, p-value <0.05). It indicates that the use of bilingual corpora can consistently bring performance improvement for Chinese conversational recommendation. One possible reason is that XNLG or mBART can fully exploit the bilingual dataset, which strengthens the capability of generating correct entities in responses for Chinese related tasks. Moreover, we notice that the model performance is further improved from the multilingual setting to the cross-lingual setting, and the reason for this result will be investigated in the future work.
But the ZH->EN cross-lingual model (Task 4) can not outperform the monolingual English model (Task 1), which is consistent with the results with the multilingual setting.
XNLG vs. mBART: According to the evaluation results in Table 3, Table 4 and Table 5, mBART Liu et al. (2020a) outperforms XNLG Chi et al. (2020) across almost all the tasks or metrics. The main reason is that mBART employs more model parameters and it uses more parallel corpora for training when compared with XNLG.
To facilitate the study of multilingual and cross-lingual conversational recommendation, we create a bilingual parallel dataset DuRecDial 2.0 and define 5 tasks on it. We further establish baselines for monolingual, multilingual, and cross-lingual conversational recommendation. Automatic evaluation and human evaluation results show that our bilingual dataset, DuRecDial 2.0, can bring performance improvement for Chinese conversational recommendation. Besides, DuRecDial 2.0 provides a challenging testbed for future studies of monolingual, multilingual, and cross-lingual conversational recommendation. In future work, we will investigate the possibility of combining multilinguality and few (or zero) shot learning to see if it can help dialog tasks in low-resource languages.
8 Ethical Considerations
Thanks for the insightful comments from reviewers and the support of dataset construction from Ying Chen. This work is supported by the National Key Research and Development Project of China (No.2018AAA0101900) and the Natural Science Foundation of China (No. 61976072).
Akbik et al. (2015)
Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Shivakumar
Vaithyanathan, and Huaiyu Zhu. 2015.
Generating high quality proposition Banks for multilingual semantic
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 397–407, Beijing, China. Association for Computational Linguistics.
- Artetxe et al. (2020) Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
- Chi et al. (2020) Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and He yan Huang. 2020. Cross-lingual natural language generation via pre-training. ArXiv, abs/1909.10481.
- Christakopoulou et al. (2016) Konstantina Christakopoulou, Katja Hofmann, and Filip Radlinski. 2016. Towards conversational recommender systems. In KDD.
- Clark et al. (2020) Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
- Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Conneau et al. (2018) Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In EMNLP.
- Dinan et al. (2019) Emily Dinan, V. Logacheva, Valentin Malykh, Alexander H. Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur D. Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, A. Black, Alexander I. Rudnicky, J. Williams, Joelle Pineau, M. Burtsev, and J. Weston. 2019. The second conversational intelligence challenge (convai2). ArXiv, abs/1902.00098.
- Dodge et al. (2016) Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander H. Miller, and Arthur Szlam andJason Weston. 2016. Evaluating prerequisite qualities for learning end-to-end dialog systems. In ICLR.
- Hardalov et al. (2020) Momchil Hardalov, Todor Mihaylov, Dimitrina Zlatkova, Yoan Dinkov, Ivan Koychev, and Preslav Nakov. 2020. EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5427–5444, Online. Association for Computational Linguistics.
- Hayati et al. (2020) Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu. 2020. Inspired: Toward sociable recommendation dialog systems. In EMNLP.
- Hu et al. (2020) J. Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and M. Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. ArXiv, abs/2003.11080.
- Jing et al. (2019) Yimin Jing, Deyi Xiong, and Zhen Yan. 2019. BiPaR: A bilingual parallel dataset for multilingual and cross-lingual reading comprehension on novels. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2452–2462, Hong Kong, China. Association for Computational Linguistics.
- Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- Kang et al. (2019) Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul Crook, Y-Lan Boureau, and Jason Weston. 2019. Recommendation as a communication game: Self-supervised bot-play for goal-oriented dialogue. In EMNLP.
Klementiev et al. (2012)
A. Klementiev, Ivan Titov, and Binod Bhattarai. 2012.
Inducing crosslingual distributed representations of words.In COLING.
- Lewis et al. (2004) David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. 5:361–397.
- Lewis et al. (2020) Patrick Lewis, Barlas Oğuz, Ruty Rinott, S. Riedel, and Holger Schwenk. 2020. Mlqa: Evaluating cross-lingual extractive question answering. ArXiv, abs/1910.07475.
- Li et al. (2021) Haoran Li, Abhinav Arora, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. 2021. MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2950–2962, Online. Association for Computational Linguistics.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In NAACL-HLT, pages 110–119.
- Li et al. (2018) Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In NIPS.
- Lin et al. (2020) Zhaojiang Lin, Zihan Liu, Genta Indra Winata, Samuel Cahyawijaya, Andrea Madotto, Yejin Bang, Etsuko Ishii, and Pascale Fung. 2020. Xpersona: Evaluating multilingual personalized chatbot. ArXiv, abs/2003.07568.
- Liu et al. (2020a) Yinhan Liu, Jiatao Gu, Naman Goyal, X. Li, Sergey Edunov, Marjan Ghazvininejad, M. Lewis, and Luke Zettlemoyer. 2020a. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- Liu et al. (2020b) Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, and Ting Liu. 2020b. Towards conversational recommendation over multi-type dialogs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Moon et al. (2019) Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. 2019. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In ACL.
- Mrkšić et al. (2017a) Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. 2017a. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1777–1788, Vancouver, Canada. Association for Computational Linguistics.
Mrkšić et al. (2017b)
Nikola Mrkšić, Ivan Vulić, Diarmuid Ó Séaghdha, Ira
Leviant, Roi Reichart, Milica Gašić, Anna Korhonen, and Steve
Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints.Transactions of the Association for Computational Linguistics, 5:309–324.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318.
- Schuster et al. (2019a) Sebastian Schuster, S. Gupta, Rushin Shah, and M. Lewis. 2019a. Cross-lingual transfer learning for multilingual task oriented dialog. ArXiv, abs/1810.13327.
Schuster et al. (2019b)
Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis.
Cross-lingual transfer learning for multilingual task oriented dialog.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3795–3805, Minneapolis, Minnesota. Association for Computational Linguistics.
- Schwenk and Li (2018) Holger Schwenk and Xian Li. 2018. A corpus for multilingual document classification in eight languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Sun and Zhang (2018) Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In SIGIR.
- Warnestal (2005) Pontus Warnestal. 2005. Modeling a dialogue strategy for personalized movie recommendations. In The Beyond Personalization Workshop.
- Wenzek et al. (2020) Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
- Wu et al. (2019) Wenquan Wu, Zhen Guo, Xiangyang Zhou, Hua Wu, Xiyuan Zhang, Rongzhong Lian, and Haifeng Wang. 2019. Proactive human-machine conversation with explicit conversation goal. In ACL.
- Zhou et al. (2020) Kun Zhou, Y. Zhou, Wayne Xin Zhao, X. Wang, and Jirong Wen. 2020. Towards topic-guided conversational recommender system. ArXiv, abs/2010.04125.
1. Turn-level Human Evaluation Guideline
Fluency measures fluency of each response:
score 0 (bad): unfluent and difficult to understand.
score 1 (fair): there are some errors in the response text but still can be understood.
score 2 (good): fluent and easy to understand.
Appropriateness examines relevancy of each response when given the current goal and local context:
score 0 (bad): not relevant to the current goal and context.
score 1 (fair): relevant to the current goal and context, but using some irrelevant knowledge.
score 2 (good): otherwise.
Informativeness examines how much knowledge (goal topics and topic attributes) is provided in responses:
score 0 (bad): no knowledge is mentioned at all.
score 1 (fair): only one knowledge triple is mentioned in the response.
score 2 (good): more than one knowledge triple is mentioned in the response.
Proactivity measures how well the model can introduce new topics with good fluency and relevance:
score 0 (bad): some new topics are introduced but irrelevant to the context.
score 1 (fair): no new topics/knowledge are used.
score 2 (good): some new topics relevant to the context are introduced.
Knowledge accuracy evaluates correctness of the knowledge in responses:
score 0 (bad): all knowledge used is wrong, or no knowledge is used.
score 1 (fair): part of the knowledge used is correct.
score 2(good): all knowledge used is correct.
2. Dialogue-level Human Evaluation Guideline
Coherence measures fluency, relevancy and logical consistency of each response when given the current goal and global context:
score 0 (bad): more than two-thirds responses irrelevant or logical contradictory to the given current goal and global context.
score 1 (fair): more than one-third responses irrelevant or logical contradictory to the given current goal and global context.
score 2 (good): otherwise.
Recommendation success rate measures the percentage of times users finally accept the recommendation at the end of a dialog:
score 0 (bad): user not accept the recommendation.
score 1 (good): user finally accept the recommendation.
3. Case Study
Figure 3 shows the conversations generated by mBART via conversing with humans, given the conversation goal and the related knowledge. It can be seen that the use of additional English data can bring performance improvement for Chinese conversational recommendation, especially in terms of Knowledge P/R/F1.