Log In Sign Up

CAiRE: An End-to-End Empathetic Chatbot

In this paper, we present an end-to-end empathetic conversation agent CAiRE. Our system adapts TransferTransfo (Wolf et al., 2019) learning approach that fine-tunes a large-scale pre-trained language model with multi-task objectives: response language modeling, response prediction and dialogue emotion detection. We evaluate our model on the recently proposed empathetic-dialogues dataset (Rashkin et al., 2019), the experiment results show that CAiRE achieves state-of-the-art performance on dialogue emotion detection and empathetic response generation.


End-to-End Neural Discourse Deixis Resolution in Dialogue

We adapt Lee et al.'s (2018) span-based entity coreference model to the ...

Training Millions of Personalized Dialogue Agents

Current dialogue systems are not very engaging for users, especially whe...

When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting

Since the seminal work of Mikolov et al. (2013a) and Bojanowski et al. (...

Modeling Content-Emotion Duality via Disentanglement for Empathetic Conversation

The task of empathetic response generation aims to understand what feeli...

Enhance word representation for out-of-vocabulary on Ubuntu dialogue corpus

Ubuntu dialogue corpus is the largest public available dialogue corpus t...

Practical Text Classification With Large Pre-Trained Language Models

Multi-emotion sentiment classification is a natural language processing ...

Leveraging Linguistic Coordination in Reranking N-Best Candidates For End-to-End Response Selection Using BERT

Retrieval-based dialogue systems select the best response from many cand...

1 Introduction

Empathetic chatbots are conversational agents that can understand users’ emotion and reply appropriately. Being empathetic is an essential step toward human-like conversation. In the early development of the conversational systems such as ELIZA Weizenbaum et al. (1966), PARRY Colby et al. (1971) and ALICE AbuShawar and Atwell (2015), efforts were put on the hand-craft features with rules. Recently, a modularized system, XiaoIce Zhou et al. (2018) achieved an impressive number of conversational turns, which is higher than average noraml human conversations. Despite its promising success, this system is designed using a complex architecture with hundreds of independent modules including Natural Language Understanding and Response Generation modules, trained using a tremendous amount of labeled data.

In contrast to the modularized dialogue system, end-to-end dialogue system learns all components in a single model in an entirely data-driven way, and it mitigates the lack of labeled data by sharing representations among different modules. Incorporating empathy into the dialogue system is essential to achieve human-like conversations because naturally, humans express and perceive emotion in natural language to increase their sense of social bonding. Practically, a multi-task training strategy with an additional objective function to optimize emotion label prediction of the conversation can produce more emotion-evoking responses Rashkin et al. (2019).

Emotion: Joyful
Situation: Speaker felt this when …
“I have had a great week!”
Speaker: I have had a great start to my week!
Listener: That’s great. Do you think the rest of the
week will be as great?
Speaker: I hope so! It looks promising!!
Listener: Lucky you. Are you always a positive per-
son or it’s just been an amazing week really?

: haha. Kind of both. And also probably too

much coffee to start my shift tonight.
Table 1: An example of the empathetic dialogue dataset. Two people are discussing a situation that happened to one of them, and that led to the experience of a given feeling.
Figure 1: Fine-tuning schema for empathetic dialogues.

However, data-driven end-to-end empathetic chatbot currently suffers from two limitations: 1). Model capacity. 2). The paucity of data for both emotion recognition and empathetic response generation Rashkin et al. (2019). Thanks to the recent success of large pre-trained language models Peters et al. (2018); Devlin et al. (2019), both problems can be mitigated.

In this paper, we extend TransferTransfo Wolf et al. (2019) learning approach on empathetic dialogue learning scenario Rashkin et al. (2019), fine-tuning a large-scale pre-trained language model Radford et al. (2018) with an additional dialogue emotion classification objective. The goal is not only generating grammatical and coherent responses but also empathetic responses according to the context of the conversation. Our experimental results show that the model trained with this strategy outperforms existing models on Empathetic Dialogues dataset in terms of the perplexity of responses and BLEU score.

2 Related Work

Detecting sentiment and emotion Felbo et al. (2017); Xu et al. (2018) has been affirmed indispensable for creating empathetic chatbots Fung et al. (2016); Bertero et al. (2016); Winata et al. (2017). Recently, Zhou et al. (2017); Hu et al. (2017); Wang and Wan (2018) introduced a framework to control the sentiment and emotion of the generated response, while Zhou and Wang (2018) introduced a new Twitter conversation dataset and proposed to leverage the emoji labels of Twitter data to generate emotional responses. Besides, Rashkin et al. (2019) proposed a new benchmark for empathetic dialogue generation, which is grounded in a situation prompted by specific emotion labels. Meanwhile, personalized dialogue agents Li et al. (2016); Zhang et al. (2018a); Madotto et al. (2019) have been deemed to make the conversation more consistent and engaging.

Previous work Peters et al. (2018); Radford et al. (2018); Devlin et al. (2019) showed that leveraging a large amount of data to learn context-sensitive features from a language model can create state-of-the-art models for a wide range of tasks. Taking this further, Radford et al. (2019); Yang et al. (2019) deployed higher capacity models and improved the state-of-the-art results. In this paper, we build the empathetic chatbot based on the pre-trained language model and achieve state-of-the-art results on dialogue emotion detection and empathetic response generation.

Pretrained Rashkin et al. (2019) 27.96 5.01 -
Fine-Tuned Rashkin et al. (2019) 21.24 6.27 -
MULTITASK Rashkin et al. (2019) 24.07 5.42 -
EmoPrepend-1 Rashkin et al. (2019) 24.30 4.36 -
ENSEM-DM Rashkin et al. (2019) 19.05 6.83 -
CAiRE 13.32 7.03 0.516
Table 2: Comparison of different automatic metrics between models. CAiRE outperforms state-of-the-art models.

3 Methodology

3.1 Language Model Pre-training

We apply the Generative Pre-trained Transformer (GPT) Radford et al. (2018) as our pre-trained language model. GPT is a multi-layer Transformer decoder with a causal self-attention which is unsupervised pre-trained on BooksCorpus dataset Zhu et al. (2015). BooksCorpus dataset contains over 7,000 unique unpublished books from a variety of genres. Pre-training on such large contiguous text corpus enable the model to capture long-range dialogue context information.

3.2 Persona Dialogue Pre-training

As existing empathetic dialogue dataset Rashkin et al. (2019) is relatively small, fine-tuning only on such dataset will limit the chitchat topic of the model. To enhance the chitchat capability of CAiRE, we first pre-train the model on PersonaChat Zhang et al. (2018b)

by following the transfer learning strategy of  

Wolf et al. (2019). This pre-training procedure endows CAiRE a persona, thus improve the engagement and consistency of the model. We refer interested readers to repository 111 recently published by HuggingFace.

3.3 Empathetic Dialogue Fine-tuning

In order to optimize the empathy of CAiRE, we fine-tune the pre-trained model using empathetic dialogue dataset Rashkin et al. (2019) with custom persona and three objectives: response language modeling, response prediction and dialogue emotion detection.

Empathetic Dialogue Dataset

Rashkin et al. (2019) introduced a new empathetic dialogue dataset of 25k open-domain one-on-one conversations based on emotional scenarios triggered by specific emotion labels. The dataset provides 32 emotion labels, the distribution of which is close to even. Table 1 shows an example from the training set. The speakers are talking about their situation, and the listeners are trying to understand their feeling and reply accordingly. At training time, the emotional labels of the speakers are given, while we hide the label in test time to evaluate the empathy of our model.

Fine-tuning Detail

The whole fine-tuning schema for empathetic dialogues is illustrated in Figure 1. To fully leverage the pre-training on PersonaChat, we customize the persona of CAiRE with two sentences: “my name is caire” and “i like to help people”.

Following the fine-tuning schema of  Wolf et al. (2019), we first concatenate the custom persona, dialogue history and response (distractor) with special separate tokens and represent all the input source with the summation of trainable positional embeddings, word embeddings, and dialogue state embeddings. Positional embeddings and word embeddings are required for transformer input, while dialogues state embeddings are added to help CAiRE model the hierarchical dialogue structure and distinguish persona sentences and dialogue context and response. The input representation is fed into the causal attention transformer decoder to get the contextualized representation. Here we denote the contextualized representation of the last special token as , the special token before reply (distractor) as .

To optimize the response prediction objective, at each training step, we sample one distractor from other conversation against the gold response. Then the

representation is pass to a linear classifier to classify the correct response and get the cross-entropy loss


To optimize the response language model objective, we take each contextualized representation of gold reply to predict the next reply tokens, and we compute the language model loss using cross-entropy .

To enable CAiRE detecting conversational partner’s emotion, we add the dialogue emotion detection objective during the training. We take as summarization of the current state of dialogue and pass it to a linear projection layer to predict the score of 32 emotions. The cross-entropy is applied for emotion classification loss .

Our final fine-tuning loss function is the weighted sum of the aforementioned losses:

Figure 2: Dialogue examples with CAiRE under happy (right half) and sad (left half) situations.

4 Experiment and Result

We evaluate our model on the empathetic dialogue dataset against the following baselines:

  • Pretrained

    : This model is trained with the full Transformer network architecture 

    Vaswani et al. (2017) on 1.7 billion REDDIT conversations.

  • Fine-Tuned: This model fine-tunes Pretrained model using the Emotion Dialogue Dataset.

  • MULTITASK: This model is trained by adding another linear layer on top of the encoder of the Transformer to classify the emotion of the dialogue based on the context.

  • EmoPrepend-1: This model prepends the top-1 predicted emotions to the beginning of the token sequence as encoder input.

  • ENSEM-DM: This model concatenates the encoded representations from the encoder of the Transformer and the representations from the pre-trained emotion classifier and feeds it to the decoder of the Transformer.

We use perplexity (PPL), average BLEU of BLEU-1, BLEU-2, BLEU-3, BLEU-4 (AVG BLEU), and emotion classification accuracy (EMO ACC) as our evaluation metrics. As a result, shown in Table

2, CAiRE outperforms all the baseline models in terms of all metrics, which shows the strong capacity of modeling empathetic response and dialogue emotion classification.

4.1 Live Chat with CAiRE

To evaluate the robustness of the models, we deploy a live interface which allows people to live chat with CAiRE. Figure 2 shows the two chatting history between a user and CAiRE. CAiRE can detect the user emotion (sad on the left example and happy on the right example) from dialogue context and response appropriately by conditioning on the multi-turn information.

5 Conclusion

In this paper, we introduce CAiRE, an end-to-end empathetic chatbot. Our system fine-tunes a large-scale pre-trained language model with three multi-task objectives: response language modeling, response prediction and dialogue emotion detection. The evaluation on the empathetic dialogue dataset shows that it achieves state-of-the-art performance on detecting dialogue emotion and generating empathetic responses. An online demo is built for live chat.


  • AbuShawar and Atwell (2015) Bayan AbuShawar and Eric Atwell. 2015. Alice chatbot: Trials and outputs. Computación y Sistemas, 19(4):625–632.
  • Bertero et al. (2016) Dario Bertero, Farhad Bin Siddique, Chien-Sheng Wu, Yan Wan, Ricky Ho Yin Chan, and Pascale Fung. 2016. Real-time speech emotion and sentiment recognition for interactive dialogue systems. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , pages 1042–1047.
  • Colby et al. (1971) Kenneth Mark Colby, Sylvia Weber, and Franklin Dennis Hilf. 1971. Artificial paranoia. Artificial Intelligence, 2(1):1–25.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Felbo et al. (2017) Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Fung et al. (2016) Pascale Fung, Anik Dey, Farhad Bin Siddique, Ruixi Lin, Yang Yang, Yan Wan, and Ho Yin Ricky Chan. 2016. Zara the Supergirl: An empathetic personality recognition system. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 87–91, San Diego, California. Association for Computational Linguistics.
  • Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In

    International Conference on Machine Learning

    , pages 1587–1596.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, Berlin, Germany. Association for Computational Linguistics.
  • Madotto et al. (2019) Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. 2019. Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 5454–5459, Florence, Italy. Association for Computational Linguistics.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
  • Rashkin et al. (2019) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Wang and Wan (2018) Ke Wang and Xiaojun Wan. 2018. Sentigan: Generating sentimental texts via mixture adversarial networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4446–4452. International Joint Conferences on Artificial Intelligence Organization.
  • Weizenbaum et al. (1966) Joseph Weizenbaum et al. 1966. Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1):36–45.
  • Winata et al. (2017) Genta Indra Winata, Onno Kampman, Yang Yang, Anik Dey, and Pascale Fung. 2017. Nora the empathetic psychologist. Proc. Interspeech 2017, pages 3437–3438.
  • Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149.
  • Xu et al. (2018) Peng Xu, Andrea Madotto, Chien-Sheng Wu, Ji Ho Park, and Pascale Fung. 2018. Emo2vec: Learning generalized emotion representation by multi-task training. Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding.
  • Zhang et al. (2018a) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018a. Personalizing dialogue agents: I have a dog, do you have pets too? Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • Zhang et al. (2018b) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018b. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213. Association for Computational Linguistics.
  • Zhou et al. (2017) Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2017. Emotional chatting machine: Emotional conversation generation with internal and external memory.
  • Zhou et al. (2018) Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2018. The design and implementation of xiaoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989.
  • Zhou and Wang (2018) Xianda Zhou and William Yang Wang. 2018. MojiTalk: Generating emotional responses at scale. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1128–1137, Melbourne, Australia. Association for Computational Linguistics.
  • Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In

    Proceedings of the IEEE international conference on computer vision

    , pages 19–27.