Since dialogue is regarded as a fundamental and complex element of human cognition Jurafsky and Martin (2000), the development of systems capable of understanding human language and communicating with humans can have a significant impact. However, human communication requires the acknowledgment and the exchange of conversational partner’s emotions, as emotions play an important role in developing a confidential relationship between the speaker and the listener.
Open domain conversational agents have been widely studied in the past years and both retrieval-based and generation-based approaches (Wu et al., 2019; Cai et al., 2019; Weston et al., 2018) have been developed. However, prior research has shown that most of those conversational agents are unable to imitate dialogues between humans, as the produced responses are generic and short (Vinyals and Le, 2015; Li et al., 2016b). Several efforts have been made to make the conversationa more engaging by keeping track of the conversational context (Sordoni et al., 2015b, a; Serban et al., 2016, 2017) or by producing more diverse responses Li et al. (2016a, c). Subsequently, a recent trend that was followed by various researchers (Li et al., 2016b; Zhang et al., 2018; Kulikov et al., 2019; Joshi et al., 2017; Zemlyanskiy and Sha, 2018; Mazaré et al., 2018; Dinan et al., 2020; Madotto et al., 2019; Hancock et al., 2019; Yavuz et al., 2019; Wolf et al., 2019) in order to make the responses more coherent and consistent through the dialogue, was to produce personalized responses by conditioning the generation on a persona profile.
Apart from understanding what is being discussed, a conversational agent should also acknowledge the emotional state of the conversational partner, as it is a significant part of human communication. A lot of researchers have focused on detecting emotion (Fan et al., 2018b; Xu et al., 2018; Winata et al., 2017, 2019) and empathy in dialogue systems (Bertero et al., 2016; Chatterjee et al., 2019). Zhou et al., 2018 introduced a seq2seq (Sutskever et al., 2014) Emotional Chatting Machine in order to generate responses with high emotional context, using emotional embeddings and an internal and external memory mechanism. A GAN-based (Goodfellow et al., 2014) framework was also proposed by Wang and Wan, 2018 that controlled the sentiment of the generated response. Wu and Wu, 2019 also used a dual-decoder to similarly generate emotional responses, given the sentiment. Zhou and Wang, 2018 introduced a Twitter dataset which used the emojis of the Twitter posts as emotion-labels and they also proposed a seq2seq model to generate emotional responses. Lubis et al., 2018 introduced a new dataset and proposed a hierarchical seq2seq response generator for affect-sensitive dialogue generation. Rashkin et al., 2019 introduced the EmpatheticDialogues dataset and trained the baselines to generate empathetic responses and simultaneously predict the corresponding emotion of the dialogue context. Later, Lin et al., 2019 introduced the "Mixture of Empathetic Listeners" framework improving the initial baselines. Santhanam and Shaikh, 2019 finetuned the GPT2 (Radford et al., 2019) model to improve the results further, while Shin et al., 2019
used reinforcement learning for predicting the user’s sentiment look-ahead along side with response generation.Lin et al., 2019 improved the performance on EmpatheticDialogues by finetuning the GPT2 model with the use of multitask learning, while Majumder et al., 2020 followed a different approach introducing stochasticity into the emotion mixture and arguing that empathetic responses do not always mirror the emotion of the user. Significant improvements were also made by Roller et al., 2021 and Shuster et al., 2020 who used multi-task training on multiple dialog tasks, achieving state-of-the-art results.
In this work, in order to enforce empathetic response generation we propose a method based on a transformer pretrained language model (T5). Specifically, during finetuning we use three objectives: response language modeling, sentiment understanding and empathy forcing. The sentiment understanding objective is crucial for tracking and acknowledging the emotional state of the conversational partner, while the empathy forcing objective favors empathetic response generation by penalizing responses that have an opposite sentiment of that of the conversational partner. Our key contribution is the inclusion of the sentiment understanding and empathy forcing auxiliary losses to promote empathetic behavior. The proposed approach, EmpBot, 111The implementation will be publicly available after the anonymity period is over, is on par with state-of-the-art in terms of BLEU score. However, our model produces significantly more fluent and empathetic responses, as indicated by human evaluation results.
2 Proposed Method
Our approach is based on the assumption that an empathetic conversational agent should mirror the emotion of the speaker (Carr et al., 2003). Following this perspective, we introduce EmpBot, a model that favors sentiment understanding and empathetic response generation using the sentiment of each dialogue context. EmpBot is based on the Unified Text-to-Text Transformer (T5) (Raffel et al., 2020), a transformer-based (Vaswani et al., 2017)
pretrained seq2seq network and we extend it with a 2-layer sentiment classifier and auxiliary losses during training, in order to apply sentiment understanding and enforce empathetic response generation. The model is illustrated in Figure1.
EmpBot model: the EmpBot
model uses the encoded contextualized representations of the dialogue context and the response, to produce the corresponding sentiment representations created by the sentiment classifier, denoted byand respectively. It is finetuned on EmpatheticDialogues dataset using three objectives: response language modeling, sentiment understanding and empathy forcing.
Response language modeling: to optimize the response language modeling objective we use the contextualized representation of the gold response and we apply language modeling by predicting the reply tokens using the cross-entropy loss. We denote that loss as .
Sentiment understanding: to optimize the sentiment understanding objective, we pass the contextualized representation of the dialogue context through the 2-layer sentiment classifier and we apply sentiment classification using cross-entropy loss. We denote that loss as . In this way, the model learns to predict the sentimental state of the dialogue, and specifically that one of the speaker, using the sentiment labels we created.
To enforce empathetic behavior, we enhance the model with a cosine similarity embedding loss. Specifically, we use the contextualized sentiment representations obtained from the first layer of the sentiment classifier, both for the dialogue context and the response. The model is penalized, when the sentiment representation of the generated response is different from that of the dialogue context, not only favoring sentiment understanding, but also promoting empathetic response generation. The aforementioned loss is:
where , are the contextualized sentiment representations and respectively and
is the cosine similarity function. Our final fine-tuning loss function is the weighted-sum of the aforementioned losses:
where and are constants.222 For more details about the hyperparameters tuning see Appendix B
For more details about the hyperparameters tuning see Appendix B
3 Experimental Setup
We conduct our experiments on the EmpatheticDialogues dataset (Rashkin et al., 2019), a dataset consisting of approximately 25k one-on-one open-domain conversations, grounded in a situation and a relevant emotion feeling. For all the experiments, we use the official 8:1:1 train/validation/test split defined by the authors. We group the provided emotions of each dialogue into two groups according to their sentiment polarity. 15 emotions are grouped as positive and 17 as negative.333For more details about the split see Appendix A
DD MT: Multitask DodecaDialogue model, proposed in Shuster et al. (2020).
DD MT+FT: Multitask DodecaDialogue model finetuned on EmpatheticDialogues dataset, proposed in Shuster et al. (2020).
Baseline: T5 model finetuned on EmpatheticDialogues for response generation.
EmpBot: Proposed T5-based model finetuned on EmpatheticDialogues using the proposed loss. Further details for the implementation, the training and testing procedures are provided in Appendix B.
3.3 Evaluation Protocol
We evaluate our models using both automatic and human evaluation. Although automated metrics can measure both the ability of the model to reproduce the listener’s response and the diversity of the responses, they do not always correlate with human judgements of dialogue quality (Liu et al., 2016). Nevertheless, we report both automatic metrics and human evaluation scores.
|BST Generative||-||(Roller et al., 2021)|
|DD MT (SOTA)||(Shuster et al., 2020)|
|DD MT+FT (SOTA)||(Shuster et al., 2020)|
|Win | Loss||Win | Loss||Win | Loss|
|EmpBot vs DD MT+FT||57.14% | 42.86%||57.73% | 42.27%||56.56% | 43.44%|
|EmpBot vs Baseline||63.81% | 36.19%||65.71% | 34.29%||59.05% | 40.95%|
|DD MT+FT vs Baseline||59.05% | 40.95%||60.95% | 39.05%||60% | 40%|
Automated metrics: We report the perplexity (PPL) of the actual (gold) response as in Wen et al., 2015; Li et al., 2016a, b. Moreover, we report BLEU scores (Papineni et al., 2002) between the model and the gold response.
In order to measure the quality of the generated responses, we conduct human evaluation, through an online survey. The human evaluation process is split in two phases. In the first phase, we compare the EmpBot model with the current state-of-the-art DD MT+FT (Shuster et al., 2020). Participants were asked to do a pairwise comparison between the generated responses of the aforementioned models according to:
Relevance and Fluency given the dialogue context (denoted by in Table 2),
Empathy given the dialogue context and the speaker’s sentiment (denoted by in Table 2) and
Empathy given the dialogue context and the speaker’s emotion (denoted by in Table 2).
Moreover, participants were also asked to rate each generated response on the three following aspects: Relevance, Fluency and Empathy, given the dialogue context for each model using a 1-5 Likert scale, where 5 is the best score. So, participants had to complete 3 A/B testing sub-tasks in order to directly compare the models and 2 rating sub-tasks for an indirect comparison.
In the second phase, we compare our EmpBot model against our baseline and the state-of-the-art DD MT+FT model against our baseline. In that phase participants were asked to compare the generated responses using the same format of the (3) A/B testing sub-tasks of the first phase.444For details about the human evaluation see Appendix C
Evaluation results and a comparison with other models are presented in Table 1. The human evaluation results are shown in Tables 2 and 3.
The DD MT+FT model still maintains the state-of-the-art performance in perplexity with the EmpBot achieving a somewhat lower performance (8.5% difference). However, we notice that both our baseline and the EmpBot model outperform the current state-of-the-art model in terms of the average BLEU score by achieving scores as low as 8.89 and 8.84 respectively. Consequently, our empathetic approach (EmpBot model) improves the state-of-the-art BLEU score, which was achieved by the DD MT, by a difference of 5.2%. We also notice that our baseline performs slightly better on BLEU score metric than the EmpBot, but the difference is not significant.
However, as the usefulness of the BLEU score has been questioned we turn to human evaluation for a more precise measure of quality. About the human evaluation results, we notice that the EmpBot model outperforms both the DD MT+FT and our baseline achieving significantly better results both on A/B and rating tests, as shown in Table 2 and 3 respectively. More specifically in Table 3, we notice a significant difference in Fluency and Empathy scores, between the EmpBot and the DD MT+FT, which shows that not only our approach is more empathetic, but the generated responses seem to be more fluent too. About the absolute Relevance score, we notice that there is not a significant difference. In addition, we should note that according to the A/B test, shown in Table 2 the DD MT+FT model seems to perform better than our baseline in all sub-tasks. We provide examples of the generated responses in Appendix E.
In this work we propose EmpBot, a T5-based chatbot, augmented with a novel finetuning procedure for generating empathetic dialogue responses. The proposed loss consists of three parts: an LM loss that produces valid textual responses, a sentiment classification loss that introduces emotional awareness to the model and an empathy forcing loss that ensures that the responses are emotionally relevant. We evaluate EmpBot
using standard evaluation metrics, i.e. perplexity and BLEU score, achieving state-of-the-art results. Our human evaluation results indicate thatEmpBot produces more fluent and empathetic responses, when compared with both the baseline and the state-of-the-art models. In the future we want to extend the proposed method for other architectures, and explore more empathy forcing losses using raw emotion values instead of sentiment polarities.
Bertero et al. (2016)
Dario Bertero, Farhad Bin Siddique, Chien-Sheng Wu, Yan Wan, Ricky Ho Yin Chan,
and Pascale Fung. 2016.
emotion and sentiment recognition for interactive dialogue systems.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Cai et al. (2019) Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiaojiang Liu, Wai Lam, and Shuming Shi. 2019. Skeleton-to-response: Dialogue generation guided by retrieval memory. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics.
- Carr et al. (2003) Laurie Carr, Marco Iacoboni, Marie-Charlotte Dubeau, John C. Mazziotta, and Gian Luigi Lenzi. 2003. Neural mechanisms of empathy in humans: A relay from neural systems for imitation to limbic areas. Proceedings of the National Academy of Sciences.
- Chatterjee et al. (2019) Ankush Chatterjee, Umang Gupta, Manoj Kumar Chinnakotla, Radhakrishnan Srikanth, Michel Galley, and Puneet Agrawal. 2019. Understanding emotions in text using deep learning and big data. Computers in Human Behavior.
- Dinan et al. (2020) Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2020. The second conversational intelligence challenge (convai2). In The NeurIPS ’18 Competition. Springer International Publishing.
- Fan et al. (2018a) Angela Fan, Mike Lewis, and Yann Dauphin. 2018a. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
Fan et al. (2018b)
Yingruo Fan, Jacqueline C. K. Lam, and Victor O. K. Li. 2018b.
Multi-region ensemble convolutional neural network for facial expression recognition.In
- Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. MIT Press.
- Hancock et al. (2019) Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. 2019. Learning from dialogue after deployment: Feed yourself, chatbot! In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In International Conference on Learning Representations.
- Joshi et al. (2017) Chaitanya K. Joshi, Fei Mi, and Boi Faltings. 2017. Personalization in Goal-Oriented Dialog. arXiv e-prints.
- Jurafsky and Martin (2000) Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR.
Kulikov et al. (2019)
Ilia Kulikov, Alexander Miller, Kyunghyun Cho, and Jason Weston. 2019.
Importance of search
and evaluation strategies in neural dialogue modeling.
Proceedings of the 12th International Conference on Natural Language Generation. Association for Computational Linguistics.
- Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
- Li et al. (2016b) Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016b. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
- Li et al. (2016c) Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016c. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Lin et al. (2019) Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, and Pascale Fung. 2019. MoEL: Mixture of empathetic listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics.
- Lin et al. (2019) Zhaojiang Lin, Peng Xu, Genta Indra Winata, Farhad Bin Siddique, Zihan Liu, Jamin Shin, and Pascale Fung. 2019. CAiRE: An Empathetic Neural Chatbot. arXiv e-prints.
- Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Lubis et al. (2018) Nurul Lubis, Sakriani Sakti, Koichiro Yoshino, and S. Nakamura. 2018. Eliciting positive emotion through affect-sensitive dialogue response generation: A neural network approach. In AAAI.
- Madotto et al. (2019) Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. 2019. Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Majumder et al. (2020) Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. 2020. MIME: MIMicking emotions for empathetic response generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- Mazaré et al. (2018) Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. 2018. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Nachar (2008) Nadim Nachar. 2008. The mann-whitney u: A test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. In Language Models are Unsupervised Multitask Learners.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research.
- Rashkin et al. (2019) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Roller et al. (2021) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics.
- Rosenberg and Ramabhadran (2017) Andrew Rosenberg and Bhuvana Ramabhadran. 2017. Bias and statistical significance in evaluating speech synthesis with mean opinion scores. In Proc. Interspeech 2017.
- Santhanam and Shaikh (2019) Sashank Santhanam and Samira Shaikh. 2019. Emotional neural language generation grounded in situational contexts. In Proceedings of the 4th Workshop on Computational Creativity in Language Generation. Association for Computational Linguistics.
Serban et al. (2016)
Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and
Joelle Pineau. 2016.
Building end-to-end dialogue systems using generative hierarchical
neural network models.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press.
- Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. AAAI Press.
- Shin et al. (2019) Jamin Shin, Peng Xu, Andrea Madotto, and Pascale Fung. 2019. HappyBot: Generating Empathetic Dialogue Responses by Improving User Experience Look-ahead. arXiv e-prints.
- Shuster et al. (2020) Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, and Jason Weston. 2020. The dialogue dodecathlon: Open-domain knowledge and image grounded conversational agents. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Sordoni et al. (2015a) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015a. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. Association for Computing Machinery.
- Sordoni et al. (2015b) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015b. A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. MIT Press.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
- Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A Neural Conversational Model. arXiv e-prints.
- Wang and Wan (2018) Ke Wang and Xiaojun Wan. 2018. Sentigan: Generating sentimental texts via mixture adversarial networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization.
- Wen et al. (2015) Tsung-Hsien Wen, Milica Gašić, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Weston et al. (2018) Jason Weston, Emily Dinan, and Alexander Miller. 2018. Retrieve and refine: Improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI. Association for Computational Linguistics.
- Winata et al. (2017) Genta Indra Winata, Onno Kampman, Yang Yang, Anik Dey, and Pascale Fung. 2017. Nora the empathetic psychologist. In Proc. Interspeech 2017.
- Winata et al. (2019) Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Jamin Shin, Yan Xu, Peng Xu, and Pascale Fung. 2019. CAiRE_HKUST at SemEval-2019 task 3: Hierarchical attention for dialogue emotion classification. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 142–147. Association for Computational Linguistics.
- Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents. arXiv e-prints.
- Wu and Wu (2019) Xiuyu Wu and Yunfang Wu. 2019. A Simple Dual-decoder Model for Generating Response with Sentiment. arXiv e-prints.
- Wu et al. (2019) Yu Wu, Furu Wei, Shaohan Huang, Yunli Wang, Zhoujun Li, and Ming Zhou. 2019. Response generation by context-aware prototype editing. Proceedings of the AAAI Conference on Artificial Intelligence, 33.
- Xu et al. (2018) Peng Xu, Andrea Madotto, Chien-Sheng Wu, Ji Ho Park, and Pascale Fung. 2018. Emo2Vec: Learning generalized emotion representation by multi-task training. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics.
- Yavuz et al. (2019) Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, and Dilek Hakkani-Tur. 2019. DeepCopy: Grounded response generation with hierarchical pointer networks. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational Linguistics.
- Zemlyanskiy and Sha (2018) Yury Zemlyanskiy and Fei Sha. 2018. Aiming to know you better perhaps makes me a more engaging dialogue partner. In Proceedings of the 22nd Conference on Computational Natural Language Learning. Association for Computational Linguistics.
- Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
- Zhou et al. (2018) Hao Zhou, Minlie Huang, T. Zhang, Xiaoyan Zhu, and Bing Liu. 2018. Emotional chatting machine: Emotional conversation generation with internal and external memory. In AAAI.
- Zhou and Wang (2018) Xianda Zhou and William Yang Wang. 2018. MojiTalk: Generating emotional responses at scale. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics.
Appendix A Dataset Seperation Split Details
We split the 32 provided emotion annotations according to their sentiment polarity, as illustrated in Table 4.
Appendix B Implementation & Training Details
We use the T5-base model from the HuggingFace library having 12 layers, 768 hidden-states, 3072 feed-forward hidden-states and 12 heads. We also use a 300-d dimensional space for the sentiment representations obtained from the 2-layer classifier. Finally, the baseline and the EmpBot model have M and M parameters respectively.
During training, we set to and to . After experimenting with various empirically selected value pairs for the parameters and , we found that the selected values yield the slightly best PPL for the validation set. We use the Adam optimizer, setting the learning rate equal to and the weight decay to . We also use a batch size of 4. All hyperparameters were manually tuned and the set with the best validation perplexity was chosen. All models were trained in a single Tesla K80 GPU provided by Google Colab.
During inference time, we use top-p (nucleus) sampling method (Holtzman et al., 2020) with top-k filtering (Fan et al., 2018a)
, by setting threshold probabilityequal to 0.9 and to 10. We also add length penalty equal to 0.6 and we set the maximum length of the generated response to be equal to 40.
Appendix C Human Evaluation Details
The human evaluation study was completed by proficient English speakers, volunteers responding to a corresponding request we posted at our university’s and research institute’s communication channels. All tests were blind and the participants could not tell which model the various dialogue responses to be evaluated were coming from. For the A/B testing, participants were asked to select the best-generated response (according to Relevance and Fluency, Empathy given the emotion of the context, and Empathy given the sentiment of the context - 3 sub-tasks ). For the rating tests, participants were asked to rate (1-5 Leikert scale) each model independently (2 sub-tasks) in terms of Empathy, Relevance and Fluency. The following clarifications were given for each metric: “Relevance evaluates whether the generated response is on-topic with the dialogue context”, “Fluency measures the grammatical correctness and readability of the generated response” and “Empathy measures whether the generated response shows the understanding of the speaker’s feelings”.
During our the evaluation study each user was presented with 7 conversations. These conversations were randomly sampled from the whole test set (2547 conversations). Therefore the participants were presented with 343 (as 49 participants took part in the first phase) and 105 (as 15 participants took part in the second phase) unique conversations, in total, during the first and second phases of the study respectively.
For testing the statistical significance of the evaluation process, we used the binomial test for the A/B testing sub-tasks as in Shuster et al., 2020 and the Mann-Whitney U non-parametric test (Nachar, 2008) for the rating sub-tasks, as it is more robust (Rosenberg and Ramabhadran, 2017)
than the t-test in Mean Opinion Scores (MOS) tests.
Appendix D Additional Results
A full comparison, based on automatic evaluation, between the baseline, the EmpBot model and other existing approaches is presented in Table 5.
|Vaswani Full Transformer||(Rashkin et al., 2019)|
|EmoPrepend-1||(Rashkin et al., 2019)|
|TopicPrepend-1||(Rashkin et al., 2019)|
|CAiRE||(Lin et al., 2019)|
|GPT2-prepend||(Santhanam and Shaikh, 2019)|
|MoEL||-||(Lin et al., 2019)|
|MIME||-||(Majumder et al., 2020)|
|HappyBot||-||(Shin et al., 2019)|
|BST Generative||-||(Roller et al., 2021)|
|DD MT (SOTA)||(Shuster et al., 2020)|
|DD MT+FT (SOTA)||(Shuster et al., 2020)|
Appendix E Example Outputs
We provide some examples of the generated responses for both the EmpBot, the baseline and the DD MT+FT models in Table 6.
|Speaker: I felt so let down by a person who promised me a project – but then pulled it|
|away at the last minute in a way that was very personal and reactionary.|
|Target: That’s terrible. People need to learn to commit to their promises.|
|EmpBot: That’s not good. I’m sorry you felt that way.|
|baseline: That’s terrible. What was the project?|
|DD MT+FT: what was the project about ?|
|Speaker: I had to give a presentation in front of my entire University class last semester. I|
|hate presenting things.|
|Target: oh, that is nerve wracking. What did you try to do to cope?|
|EmpBot: That’s a real shame. I’m sure you did a great job.|
|baseline: I’m sorry to hear that.|
|DD MT+FT: how did it go ?|
|Speaker: I am going to my local amusement park tomorrow and feeling a certain way about|
|Target: Which way are you feeling? Are you a thrill seeker?|
|EmpBot: That sounds like fun! I love going to the amusement park!|
|baseline: I’m sorry to hear that.|
|DD MT+FT: what amusement park are you going to ?|
|Speaker: I felt very isolated when we first moved to Hong Kong – I had a young baby and|
|my husband pretty much needed to start traveling for business right away!|
|Target: Wow, I can’t imagine how you felt living in Hong Kong. That would take some|
|EmpBot: I’m sorry to hear that. Is he still in Hong Kong?|
|baseline: That’s a bummer. I’m sorry to hear that.|
|DD MT+FT: i ’ m sorry to hear that . did you have a lot of fun ?|