is a popular research direction within the area of natural language processing. For the sake of engagement, diverse and consistent responses(Song et al., 2020, 2021) are important factors, and personas information (Zhang et al., 2018) gives rise to both of the factors. There are two types of personas information, namely self persona and partner persona. The former refers to a self profile consisting of several sentences representing the dialogue agent. Such a persona allows the agent to produce consistent responses rather than solely relying on the personas information (Kim et al., 2020) that is randomly learned and embedded in the model parameters. The latter also refers to a profile but representing the user. Leveraging such partner personas is helpful for dialogue generation (Gu et al., 2021). Therefore, we exploit partner personas for diverse dialogue generation.
Unfortunately, the user profile could be commonly missing due to the cold-start (Li et al., 2020) when deploying online dialogue agents or for newly registered users. Most of the works, if not all, (Mazaré et al., 2018; Song et al., 2019; Gu et al., 2019; Zhao et al., 2019) have been either overlooking the value of partner personas or simply focusing on the impractical situation where partner personas guarantee to exist in both training and inference stages. Our work demonstrates the importance of diverse partner personas generation, and we particularly investigate the practical situation when partner personas are missing in the inference stage. Such an investigation is essential as there is no guarantee that the ground truth partner personas always exist. Ultimately, our proposed framework can produce even more diverse and engaging responses than our baseline that conditions on the ground truth partner personas. We demonstrate a case study in Section 5.2 that illustrates this advantage.
To our best knowledge, this is the first attempt to formulate partner personas prediction in a generative manner that boosts the performance of the succeeding dialogue generation. Our work is motivated by three underlying hypotheses: i) Partner personas generation is plausible given the self personas and dialogue context. ii) Generated personas are more diverse and interesting than the retrieved ground truth. iii) Such diverse generated personas help to produce diverse succeeding dialogue responses. Our automatic and human evaluation results support these hypotheses, and this paper paves the way to exploit generative partner personas for diverse dialogue generation.
We develop a novel framework composed of three major components, namely a partner personas generator, a dialogue response generator and a critic network. We use a partner personas generator to generate partner personas, which the dialogue response generator uses for succeeding dialogue response generation. We employ reinforcement learning with a dedicatedly designed critic network that propagates the reward back to the generators.
Prior works have investigated retrieval-based partner persona (Zhang et al., 2018; Song et al., 2019). The human-constructed ground truth personas serve as the upper bound for such retrieval-based systems, and we argue that the ground truth are not diverse enough. We observe that the generative counterpart generates relevant, informative and coherent partner personas, which further diversifies the succeeding dialogue response generation. It follows another advantage that our framework does not need an external database to retrieve from (Madotto et al., 2020; Xu et al., 2021).
One close work to ours is a multi-task framework for meta-learning (Lee et al., 2021) that uses personas reconstruction as an auxiliary task to improve response consistency. The differences are that theirs does not differentiate between self personas and partner personas, while ours does. Their model focuses on meta-learning, while ours does not set such a constraint. Theirs indicates an improvement over personality consistency only, while ours focuses on diverse dialogue generation. We conduct an empirical comparison with their model, reconstructing the partner personas. Experimental results indicate that such a multi-task model does not work well in our problem setting.
We compare our proposed framework with some competitive baselines. The automatic and human evaluation results indicate that our framework can generate even more diverse and engaging responses than the baseline conditioned on ground truth partner personas. It leads to a conclusion that i) Partner personas generation is plausible. ii) The generated partner personas are more diverse than the ground truth partner personas. iii) Our framework produces even more diverse and engaging responses than our competitive baselines that condition on the ground truth partner personas.
2 Related Work
2.1 Personalized Dialgoue Generation
Conditioning on personas helps produce informative and engaging responses. The largest multi-turn dialogue dataset conditioned on personal profiles is PersonaChat, in which two crowdsourcers converse and find more about each other. To better utilise the self personas in generating consistent responses, the community has proposed quite a lot of methods. Mazaré et al. (2018) employs a pre-training stage based on dedicatedly extracted large-scale persona-based dialogues and fine-tunes the model on PersonaChat. Zhao et al. (2019) fuses information in personas and dialogue context into individual contextualised representations by attending to different parts of both. Gu et al. (2019) exploits the interaction between personas, dialogue context and response to improve retrieval-based dialogue agents. Lee et al. (2021) utilises multi-task learning for improved personality consistency in the meta-learning scenario. Gu et al. (2021) employs four different strategies for personas fusing, which learns to use self persona and partner persona in a more effective manner. There have also been several works based on GPT (Wolf et al., 2019). However, most of these prior works focus on exploiting self personas rather than partner personas, and they have been assuming that the ground truth partner personas guarantee to exist.
2.2 Reinforcement Learning
Reinforcement learning (RL), or specifically, policy gradient methods (Williams, 1992), have been frequently adopted to both task-oriented dialogue agents (Roman Roman et al., 2020; Deng et al., 2021) or open-domain chitchat agents (Li et al., 2016; Saleh et al., 2019). It can either propagate non-differentiable loss (Cai et al., 2019) or optimize an expert reward such as ease of answering (Li et al., 2016). It also adopts a scenario where a user simulator and a dialogue agent interact, and an expert reward function is defined to assign the goodness to each response generated (Roman Roman et al., 2020).
3 Proposed Framework
We develop a novel framework composed of three major components, namely a partner personas generator, a dialogue response generator and a critic network used by reinforcement learning. Figure 1 depicts the inference flow of our setting. The input dialogue context with self persona is first fed into the partner personas generator. The generated partner personas output is then concatenated with the dialogue context and the self personas as the input into the dialogue response generator. In the beginning, we train our partner personas generator and dialogue response generator separately
under supervised learning. In the training stage, we use the ground truth partner personas to train the dialogue response generator, and we replace it with generated partner personas in the inference stage. After the supervised learning stage, the second stage is a reinforcement learning stage whichjointly optimizes both partner personas generator and dialogue response generator as depicted in Figure 2. Such framework has two advantages: i) The partner personas generator can be trained directly using the reward signal that is relevant to dialogue response generation. ii) The dialogue response generator trained using ground truth partner personas can be further fine-tuned on the generated partner personas.222Section 5.4 presents an ablation study on reinforcement learning that supports this claim. Particularly, we employ a dedicatedly designed critic network that receives generated partner personas and generated dialogue responses as the input and output a reward that measures the relevance between the generated personas and responses and propagates back to the generators.
3.1 Partner Personas Generation
A Seq2Seq neural network(Sutskever et al., 2014) is adopted as our partner personas generator for the task of partner personas generation (PPG). The concatenation of dialogue context c and self personas is fed as an input into the partner personas generator. The personas generator then outputs an approximated partner personas conditioned on the input, which maximises the following conditional likelihood:
For training, the ground truth partner personas
is used and we train our generator under the maximum likelihood estimation:
3.2 Dialogue Response Generation
We also adopt a Seq2Seq neural network for the task of dialogue response generation (DRG). The concatenation of dialogue context c, self personas , and partner personas is fed as an input into the dialogue response generator. The personas generator then outputs an approximated dialogue response conditioned on the input, which maximises the following conditional likelihood:
For training, the ground truth partner personas and the ground truth dialogue response is used and we train our generator under the maximum likelihood estimation:
We use the ground truth partner personas for training and generated partner personas for inference.
3.3 Reinforcement Learning
We employ a critic network that is the core of our reinforcement learning (RL) algorithm to reward our reinforcement agents. We train a binary classifier as our critic network by extracting sub-training-instances, where represents positive training samples. We then randomly sample two distinct positive sub-instances and :
Then two negative samples can be derived as:
Thereafter, we fine-tune a binary classifier as our critic on this training set by minimizing the following binary cross-entropy loss:
In the equation above, the binary label indicates whether the given response is relevant to the given personas. During the reinforcement learning stage, this classifier acts as a critic network that outputs , conditioned on the generated partner personas and generated response . The predicted binary label is then converted to a reward . is a positive reward when , and is a negative reward when . We then update our RL agents with the following gradients:
for the partner personas generator (PPG), and for the dialogue response generator (DRG):
We particularly want to give positive rewards to our RL agents when they give high-quality responses that differ from the ground truth. Since it is not straightforward to understand the underlying motivation for such a critic network, we divide it into two cases and conquer each of them:
The critic network outputs as : the generated personas and response are irrelevant, and we assign a negative reward.
The critic network outputs as : the generated personas and response are relevant, and we assign a positive reward.
The first case is trivial, as it is reasonable to assign a negative reward when at least one of our RL agents generates an output far away from the ground truth. For the second case, in addition to the trivial case that both agents output ground-truth-like generation, it also considers such a case when both partner personas generator and dialogue response generator generate a relevant output, but not the exact ground truth. Maximum likelihood estimation might fail to capture this reward, as there could still be a certain distance to the ground truth. In contrast, our critic network captures this by outputting and assigns both of our RL agents a positive reward. We design such a dedicate reward mechanism to encourage the generator to produce a diverse and engaging response with the diverse partner personas generated. We present a case study in Section 5.2 that illustrates this advantage.
Previous work (Cai et al., 2019) employed critic network for reinforcement loss backpropagation. At first glance, our usage of the critic network shares some resemblances to theirs, but indeed, the underlying motivation vastly differs. The major difference is that their critic is trained in an adversarial manner (Li et al., 2018) to pick up the ground truth response among other negative candidates. Also, their critic network conditions only on the dialogue response but not on the generated skeleton. In contrast, we further diversify the response generation with a classifier conditioning on both the generated personas and the generated response.
|E2E w/o Partner Personas|
|E2E w/ Training Partner Personas|
|Multi-task Learning (Lee et al., 2021)|
|Our Framework w/ RL PPG|
|Our Framework w/ RL DRG|
|Our Framework w/o RL|
|Our Framework w/ RL PPG&DRG|
4 Experimental Setup
We conduct all our experiments on PersonaChat (Zhang et al., 2018), the largest multi-turn dialogue dataset conditioned on personas profile. We follow the training/validation/testing split from the ParlAI platform (Miller et al., 2017), which contains about 65,000 training instances, about 7,800 validation instances and about 7,500 testing instances. As for the reinforcement learning in Section 3.3, we collect about 130,000 training instances from the training partition with equally distributed positive and negative samples to train our critic network.
4.2 Baselines and Comparison Models
End-to-end Baseline without Partner Personas
Our first baseline is an end-to-end response generator without any partner persona information throughout the experiment. We notice that it could be unfair to directly compare our proposed framework with this baseline, as our framework uses ground truth partner personas in the training stage for our proposed framework. Since it does not use the same amount of training information as our proposed framework, we offer our second baseline trained with the ground truth partner personas.
End-to-end Baseline with Training Partner Personas
Our second baseline is a an end-to-end response generator that is trained using ground truth partner personas. In the inference stage, we feed the concatenation of self personas and dialogue context as the input. For the sake of fairness, it uses the same amount of training information and inference information as our proposed framework.
Multi-task Learning Comparison Model
Following prior work (Lee et al., 2021), we build a multi-task learning comparison model with partner personas generation as an auxiliary task. The model is trained to maximise the training objective of the sum of the partner personas generation labels likelihood , and the dialogue response generation labels likelihood . Both of the tasks are generated and conditioned on dialogue context and self personas by sharing the same model parameters. We maximise the loss:
where is a loss weighting parameter which we tune it over the validation set.
4.3 Implementation Details
For all of our baselines, comparison model, the partner personas generator and the dialogue response generator, we use pre-trained GPT-2(Radford et al., 2019) to initialize their model parameters. For the supervised phase, we set Adam (Kingma and Ba, 2014)
as our optimizer, with hyperparameters, , ,
. We fine-tune 2 epochs on all the baselines, comparison models and our proposed framework modules and select the models with the lowest validating perplexity. For the RL phase, we set Adam as our optimizer, with hyperparameters, , , . We update the model parameters every training instances and validate the model performance every updates. For our critic network for reward judgement in the RL phase, we use DistilBERT (Sanh et al., 2019) to initialize the model parameters. We set Adam as our optimizer, with hyperparameters , , , . We fine-tune the critic for 1 epoch on the original training split from the PersonaChat. We conduct all our experiments based on the Transformers library from Huggingface (Wolf et al., 2020).
4.4 Evaluation Metrics
We report intrinsic perplexity to evaluate the model with the ground truth response (Roller et al., 2020). We report distinct-1 and distinct-2 (Li et al., 2015) to evaluate model diversity, which calculate the ratio of distinct unigrams/bigrams against total unigrams/bigrams generated. We report ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) as the extrinsic evaluations.
Recall our four claims: a) Our framework generates diverse partner personas than the ground truth partner personas. b) Partner personas generation benefits the succeeding response generation, thus surpassing our baselines. c) Our framework generates more diverse and engaging responses than our competitive baseline that uses ground truth training and inference partner personas. d) We employ reinforcement learning with a dedicatedly designed critic network that effectively boosts our framework. Also, recall the three hypotheses that motivate our work: i) Partner personas generation is plausible. ii) Generated personas are more diverse and interesting than the retrieved ground truth. iii) Generated personas boost succeeding dialogue generation. In the remaining section, we attempt to verify these claims and hypotheses.
5.1 Main Results
The main results are presented in Table 1.
Our Framework w/o RL
Even when trained without the reinforcement learning algorithm, our framework surpasses all our baselines and a comparison model. This phenomenon serves as an evidence for our claim b) and hypothesis iii). Therefore, our framework more efficiently parameterises the training partner personas than our E2E baseline trained with ground truth partner personas. Also, our framework relaxes the constraint that requires the main and the auxiliary task to have a similar nature, which is unrealistic in our case.
|Ground Truth Label|
|Ground Truth Label||0.003267||0.03856|
|E2E w/ Full|
Our Framework w/ RL PPG&DRG
Our proposed framework performs the best when trained with reinforcement learning. Again, this phenomenon strengthens our claim b) and hypothesis iii). In addition, this phenomenon serves as direct evidence for our claim d). This phenomenon implicitly verifies our hypothesis i), because we train our critic network using ground truth partner personas. Since the reinforcement learning algorithm is effective, our critic network produces valid rewards, which indicates that the generated partner personas are reasonable. Furthermore, we present the change of validation performance for dialogue generation and partner personas generation in Figure 3.333The result is scaled for the sake of space and clarity. We observe that both performances improve during the RL stage. Therefore, the improvement in partner personas generation intimately relates to the improvement in the succeeding dialogue generation; see our claim b) and iii).
|Criteria||E2E w/ Full||Ours|
End-to-end (E2E) Baseline Models
Our E2E baseline with training partner personas has a better perplexity and worse extrinsic scores than the E2E baseline without partner personas. This might be due to the training-inference discrepancy, which could significantly impact the extrinsic evaluations.
Multi-task Learning Comparison Model
Our multi-task learning comparison model produces inferior results. This is, however, predictable. First of all, the prior work (Lee et al., 2021) constrained itself to a meta-learning framework. More concretely, the nature of partner personas generation and dialogue response generation largely differs. The output format of partner personas always initiates with first-person sentence starters, while dialogue responses give more general responses ranging from greetings to goodbyes.
5.2 Case Study on Dialogue Response Generation
Table 4 depicts the case study for response generation. In the first case, our partner personas generator successfully gave a reasonable imagination that a person who likes shopping could be rich and drive a luxury car, which is not in the ground truth personas. This follows a surprising response ‘I am a bit envious’. In the second case, our personas generator successfully identifies that the partner lives in an apartment. Succeedingly, the response generator gives a relevant response and reduces the undesired hallucination444‘I like to watch TV’ is in the self personas, but ‘I live in California’ is not. The latter one is thus a hallucination. from the baseline model. In the third case, our partner personas generator generates a partner persona ‘I love the outdoors’, which is not even in the ground truth personas. After that, the response generator produces a relevant response which also expresses empathy. These facts support our underlying hypotheses i), ii), and iii). Furthermore, as in Table 3, we compare our proposed framework with an end-to-end dialogue agent with both training and inference ground truth partner personas. For these cherry-picked examples, our framework generates more informative and engaging responses than this competitive baseline. This verifies our claim c) and hypothesis iii).
5.3 Human Evaluation
We hired experienced annotators who have degrees relevant to English Linguistics or Applied Linguistics. We present a questionnaire composed of 280 questions with randomly sampled 70 testing instances. Three annotators compare model outputs in an A/B setting. As in previous work (Zou et al., 2021) and ACUTE-Evals (Li et al., 2019), annotators follow the criteria:
(Appropriateness): "Who is more appropriate given the previous dialogue context?"
(Informativeness): "Who is more diverse instead of null answers such as I do not know?"
(Engagingness): "Who would you prefer to talk with for a long conversation?"
(Human-likeness): "Which speaker do you think sounds more like a real person?"
Table 5 presents the human evaluation results. It is exciting to see our framework trained under reinforcement learning surpasses the end-to-end model that leverages both training and inference ground truth partner personas, from all the aspects, significantly. This supports our claim c) and d).
5.4 Ablation Study on Reinforcement Learning
Table 1 presents an ablation study on the framework when only one of the modules, namely partner personas generator (PPG) or dialogue response generator (DRG), is trained under reinforcement learning. Our framework exceeds these two variants in all the metrics except for perplexity, which aligns with the prior work (Roller et al., 2020).555PPL does not always correlate well with other metrics.
5.5 Case Study on Partner Personas Generation
As depicted in Table 2, we observe that our partner personas generator generates more diverse partner personas compared to the ground truth partner personas label, which is essentially the upper bound for retrieval-based partner personas predictor. This phenomenon verifies our claim a) and hypothesis i), indicating our generator produces even more informative and interesting partner personas than the ground truth partner personas.
As depicted in Table 6, our partner personas generator can generate plausible partner personas which are relevant to the ground truth partner personas. Our partner personas generator can give fascinating but reasonable imagination that is not even in the dialogue context or the ground truth partner personas. In the first case, the generator successfully identified the partner as a student studying biology. The generator recognizes the partner as being married in the second case, which is not even mentioned in the dialogue context. This phenomenon could be a matter of the fact that personas could be semantically closer to each other when they frequently co-occur in the training set. In the third case, the generator generates diverse personas, saying that the partner would drink beer and eat food while watching football, which is even not in any of the dialogue context or the ground truth partner personas. This verifies claim a) and hypothesis i) and ii), We depict more case studies in Appendix B to show that our personas generator generates informative and coherent partner personas.666Our partner personas generator is even capable of producing unseen personas. An offensiveness check thus is necessary for the actual usage, as in prior works (Baheti et al., 2021).
Our novel framework incorporates partner personas generation into succeeding dialogue response generation. First of all, our proposed framework mitigates the cold-start problem in practical applications when ground truth partner personas could be missing during inference. The experimental results with both automatic and human evaluation demonstrate that our framework generates informative and coherent partner personas, even compared to the ground truth partner personas, yet still reasonable and relevant. This enhances the succeeding response generation, thus surpassing our baselines and producing responses that are more diverse and engaging than our baseline conditioned on the ground truth partner personas. We employ reinforcement learning with a dedicatedly designed critic network that boosts our framework. Extensive case studies demonstrate that our framework can generate satisfying dialogue responses and partner personas. Finally, our framework gives better explainability and reduces the demands for external databases for partner personas.
- Baheti et al. (2021) Ashutosh Baheti, Maarten Sap, Alan Ritter, and Mark Riedl. 2021. Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts. arXiv e-prints, page arXiv:2108.11830.
- Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- Bao et al. (2021) Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhihua Wu, Zhen Guo, Hua Lu, Xinxian Huang, Xin Tian, Xinchao Xu, Yingzhan Lin, and Zhengyu Niu. 2021. PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation. arXiv e-prints, page arXiv:2109.09519.
- Cai et al. (2019) Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiaojiang Liu, Wai Lam, and Shuming Shi. 2019. Skeleton-to-response: Dialogue generation guided by retrieval memory. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1219–1228, Minneapolis, Minnesota. Association for Computational Linguistics.
- Deng et al. (2021) Yang Deng, Yaliang Li, Fei Sun, Bolin Ding, and Wai Lam. 2021. Unified Conversational Recommendation Policy Learning via Graph-based Reinforcement Learning. arXiv e-prints, page arXiv:2105.09710.
- Gu et al. (2019) Jia-Chen Gu, Zhen-Hua Ling, Xiaodan Zhu, and Quan Liu. 2019. Dually interactive matching network for personalized response selection in retrieval-based chatbots. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1845–1854, Hong Kong, China. Association for Computational Linguistics.
- Gu et al. (2021) Jia-Chen Gu, Hui Liu, Zhen-Hua Ling, Quan Liu, Zhigang Chen, and Xiaodan Zhu. 2021. Partner Matters! An Empirical Study on Fusing Personas for Personalized Response Selection in Retrieval-Based Chatbots. arXiv e-prints, page arXiv:2105.09050.
- Kim et al. (2020) Hyunwoo Kim, Byeongchang Kim, and Gunhee Kim. 2020. Will I sound like me? improving persona consistency in dialogues through pragmatic self-consciousness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 904–916, Online. Association for Computational Linguistics.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv e-prints, page arXiv:1412.6980.
- Lee et al. (2021) Jing Yang Lee, Kong Aik Lee, and Woon Seng Gan. 2021. Generating Personalized Dialogue via Multi-Task Meta-Learning. arXiv e-prints, page arXiv:2108.03377.
- Li et al. (2018) Dianqi Li, Qiuyuan Huang, Xiaodong He, Lei Zhang, and Ming-Ting Sun. 2018. Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning. arXiv e-prints, page arXiv:1804.00861.
- Li et al. (2015) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A Diversity-Promoting Objective Function for Neural Conversation Models. arXiv e-prints, page arXiv:1510.03055.
- Li et al. (2016) Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. 2016. Deep Reinforcement Learning for Dialogue Generation. arXiv e-prints, page arXiv:1606.01541.
- Li et al. (2019) Margaret Li, Jason Weston, and Stephen Roller. 2019. ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons. arXiv e-prints, page arXiv:1909.03087.
- Li et al. (2020) Shijun Li, Wenqiang Lei, Qingyun Wu, Xiangnan He, Peng Jiang, and Tat-Seng Chua. 2020. Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-Start Users. arXiv e-prints, page arXiv:2005.12979.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Madotto et al. (2020) Andrea Madotto, Samuel Cahyawijaya, Genta Indra Winata, Yan Xu, Zihan Liu, Zhaojiang Lin, and Pascale Fung. 2020. Learning knowledge bases with parameters for task-oriented dialogue systems. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2372–2394, Online. Association for Computational Linguistics.
- Mazaré et al. (2018) Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. 2018. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2775–2779, Brussels, Belgium. Association for Computational Linguistics.
- Miller et al. (2017) Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and Jason Weston. 2017. ParlAI: A Dialog Research Software Platform. arXiv e-prints, page arXiv:1705.06476.
- Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
- Roller et al. (2020) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, and Jason Weston. 2020. Recipes for building an open-domain chatbot. arXiv e-prints, page arXiv:2004.13637.
- Roman Roman et al. (2020) Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, and Jianfeng Gao. 2020. RMM: A recursive mental model for dialogue navigation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1732–1745, Online. Association for Computational Linguistics.
- Saleh et al. (2019) Abdelrhman Saleh, Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, and Rosalind Picard. 2019. Hierarchical Reinforcement Learning for Open-Domain Dialog. arXiv e-prints, page arXiv:1909.07547.
- Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv e-prints, page arXiv:1910.01108.
- Song et al. (2021) Haoyu Song, Yan Wang, Kaiyan Zhang, Wei-Nan Zhang, and Ting Liu. 2021. BoB: BERT over BERT for training persona-based dialogue models from limited personalized data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 167–177, Online. Association for Computational Linguistics.
- Song et al. (2020) Haoyu Song, Yan Wang, Wei-Nan Zhang, Xiaojiang Liu, and Ting Liu. 2020. Generate, delete and rewrite: A three-stage framework for improving persona consistency of dialogue generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5821–5831, Online. Association for Computational Linguistics.
- Song et al. (2019) Haoyu Song, Wei-Nan Zhang, Yiming Cui, Dong Wang, and Ting Liu. 2019. Exploiting Persona Information for Diverse Generation of Conversational Responses. arXiv e-prints, page arXiv:1905.12188.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. arXiv e-prints, page arXiv:1409.3215.
- Williams (1992) Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8(3–4):229–256.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents. arXiv e-prints, page arXiv:1901.08149.
- Xu et al. (2021) Yan Xu, Etsuko Ishii, Zihan Liu, Genta Indra Winata, Dan Su, Andrea Madotto, and Pascale Fung. 2021. Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters. arXiv e-prints, page arXiv:2105.06232.
- Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
- Zhang et al. (2019) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. arXiv e-prints, page arXiv:1911.00536.
- Zhao et al. (2019) Xueliang Zhao, Chongyang Tao, Wei Wu, Can Xu, Dongyan Zhao, and Rui Yan. 2019. A Document-grounded Matching Network for Response Selection in Retrieval-based Chatbots. arXiv e-prints, page arXiv:1906.04362.
- Zou et al. (2021) Yicheng Zou, Zhihua Liu, Xingwu Hu, and Qi Zhang. 2021. Thinking Clearly, Talking Fast: Concept-Guided Non-Autoregressive Generation for Open-Domain Dialogue Systems. arXiv e-prints, page arXiv:2109.04084.
Appendix A Validation Performance
|E2E w/ Full|
Appendix B More Case Studies
Table 9 presents extensive case studies for partner personas generation. These examples indicate that our framework can generate informative and coherent partner personas. We highlight in pink for informativeness and in yellow for coherence.
|E2E w/o Partner Personas|
|E2E w/ Training Partner Personas|
|Multi-task Learning (Lee et al., 2021)|
|Our Framework w/ RL PPG|
|Our Framework w/ RL DRG|
|Our Framework w/o RL|
|Our Framework w/ RL PPG&DRG|