EmpBot: A T5-based Empathetic Chatbot focusing on Sentiments

In this paper, we introduce EmpBot: an end-to-end empathetic chatbot. Empathetic conversational agents should not only understand what is being discussed, but also acknowledge the implied feelings of the conversation partner and respond appropriately. To this end, we propose a method based on a transformer pretrained language model (T5). Specifically, during finetuning we propose to use three objectives: response language modeling, sentiment understanding, and empathy forcing. The first objective is crucial for generating relevant and coherent responses, while the next ones are significant for acknowledging the sentimental state of the conversational partner and for favoring empathetic responses. We evaluate our model on the EmpatheticDialogues dataset using both automated metrics and human evaluation. The inclusion of the sentiment understanding and empathy forcing auxiliary losses favor empathetic responses, as human evaluation results indicate, comparing with the current state-of-the-art.



There are no comments yet.


page 1

page 2

page 3

page 4


Fluent Response Generation for Conversational Question Answering

Question answering (QA) is an important aspect of open-domain conversati...

Deep Learning Based Chatbot Models

A conversational agent (chatbot) is a piece of software that is able to ...

SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures

Current open-domain conversational models can easily be made to talk in ...

Transformer-based language modeling and decoding for conversational speech recognition

We propose a way to use a transformer-based language model in conversati...

POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling

Conversational search systems, such as Google Assistant and Microsoft Co...

A Simple Dual-decoder Model for Generating Response with Sentiment

How to generate human like response is one of the most challenging tasks...

A Repository of Conversational Datasets

Progress in Machine Learning is often driven by the availability of larg...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since dialogue is regarded as a fundamental and complex element of human cognition Jurafsky and Martin (2000), the development of systems capable of understanding human language and communicating with humans can have a significant impact. However, human communication requires the acknowledgment and the exchange of conversational partner’s emotions, as emotions play an important role in developing a confidential relationship between the speaker and the listener.
Open domain conversational agents have been widely studied in the past years and both retrieval-based and generation-based approaches (Wu et al., 2019; Cai et al., 2019; Weston et al., 2018) have been developed. However, prior research has shown that most of those conversational agents are unable to imitate dialogues between humans, as the produced responses are generic and short (Vinyals and Le, 2015; Li et al., 2016b). Several efforts have been made to make the conversationa more engaging by keeping track of the conversational context (Sordoni et al., 2015b, a; Serban et al., 2016, 2017) or by producing more diverse responses Li et al. (2016a, c). Subsequently, a recent trend that was followed by various researchers (Li et al., 2016b; Zhang et al., 2018; Kulikov et al., 2019; Joshi et al., 2017; Zemlyanskiy and Sha, 2018; Mazaré et al., 2018; Dinan et al., 2020; Madotto et al., 2019; Hancock et al., 2019; Yavuz et al., 2019; Wolf et al., 2019) in order to make the responses more coherent and consistent through the dialogue, was to produce personalized responses by conditioning the generation on a persona profile.
Apart from understanding what is being discussed, a conversational agent should also acknowledge the emotional state of the conversational partner, as it is a significant part of human communication. A lot of researchers have focused on detecting emotion (Fan et al., 2018b; Xu et al., 2018; Winata et al., 2017, 2019) and empathy in dialogue systems (Bertero et al., 2016; Chatterjee et al., 2019). Zhou et al., 2018 introduced a seq2seq (Sutskever et al., 2014) Emotional Chatting Machine in order to generate responses with high emotional context, using emotional embeddings and an internal and external memory mechanism. A GAN-based (Goodfellow et al., 2014) framework was also proposed by Wang and Wan, 2018 that controlled the sentiment of the generated response. Wu and Wu, 2019 also used a dual-decoder to similarly generate emotional responses, given the sentiment. Zhou and Wang, 2018 introduced a Twitter dataset which used the emojis of the Twitter posts as emotion-labels and they also proposed a seq2seq model to generate emotional responses. Lubis et al., 2018 introduced a new dataset and proposed a hierarchical seq2seq response generator for affect-sensitive dialogue generation. Rashkin et al., 2019 introduced the EmpatheticDialogues dataset and trained the baselines to generate empathetic responses and simultaneously predict the corresponding emotion of the dialogue context. Later, Lin et al., 2019 introduced the "Mixture of Empathetic Listeners" framework improving the initial baselines. Santhanam and Shaikh, 2019 finetuned the GPT2 (Radford et al., 2019) model to improve the results further, while Shin et al., 2019

used reinforcement learning for predicting the user’s sentiment look-ahead along side with response generation.

Lin et al., 2019 improved the performance on EmpatheticDialogues by finetuning the GPT2 model with the use of multitask learning, while Majumder et al., 2020 followed a different approach introducing stochasticity into the emotion mixture and arguing that empathetic responses do not always mirror the emotion of the user. Significant improvements were also made by Roller et al., 2021 and Shuster et al., 2020 who used multi-task training on multiple dialog tasks, achieving state-of-the-art results.
In this work, in order to enforce empathetic response generation we propose a method based on a transformer pretrained language model (T5). Specifically, during finetuning we use three objectives: response language modeling, sentiment understanding and empathy forcing. The sentiment understanding objective is crucial for tracking and acknowledging the emotional state of the conversational partner, while the empathy forcing objective favors empathetic response generation by penalizing responses that have an opposite sentiment of that of the conversational partner. Our key contribution is the inclusion of the sentiment understanding and empathy forcing auxiliary losses to promote empathetic behavior. The proposed approach, EmpBot, 111The implementation will be publicly available after the anonymity period is over, is on par with state-of-the-art in terms of BLEU score. However, our model produces significantly more fluent and empathetic responses, as indicated by human evaluation results.

2 Proposed Method

Our approach is based on the assumption that an empathetic conversational agent should mirror the emotion of the speaker (Carr et al., 2003). Following this perspective, we introduce EmpBot, a model that favors sentiment understanding and empathetic response generation using the sentiment of each dialogue context. EmpBot is based on the Unified Text-to-Text Transformer (T5) (Raffel et al., 2020), a transformer-based (Vaswani et al., 2017)

pretrained seq2seq network and we extend it with a 2-layer sentiment classifier and auxiliary losses during training, in order to apply sentiment understanding and enforce empathetic response generation. The model is illustrated in Figure


Figure 1: Illustration of the EmpBot model. The contextualized sentiment representations and are used for calculating the sentiment understanding and the empathy forcing auxiliary losses.

EmpBot model: the EmpBot

model uses the encoded contextualized representations of the dialogue context and the response, to produce the corresponding sentiment representations created by the sentiment classifier, denoted by

and respectively. It is finetuned on EmpatheticDialogues dataset using three objectives: response language modeling, sentiment understanding and empathy forcing.
Response language modeling: to optimize the response language modeling objective we use the contextualized representation of the gold response and we apply language modeling by predicting the reply tokens using the cross-entropy loss. We denote that loss as .
Sentiment understanding: to optimize the sentiment understanding objective, we pass the contextualized representation of the dialogue context through the 2-layer sentiment classifier and we apply sentiment classification using cross-entropy loss. We denote that loss as . In this way, the model learns to predict the sentimental state of the dialogue, and specifically that one of the speaker, using the sentiment labels we created.
Empathy forcing:

To enforce empathetic behavior, we enhance the model with a cosine similarity embedding loss. Specifically, we use the contextualized sentiment representations obtained from the first layer of the sentiment classifier, both for the dialogue context and the response. The model is penalized, when the sentiment representation of the generated response is different from that of the dialogue context, not only favoring sentiment understanding, but also promoting empathetic response generation. The aforementioned loss is:


where , are the contextualized sentiment representations and respectively and

is the cosine similarity function. Our final fine-tuning loss function is the weighted-sum of the aforementioned losses:


where and are constants.222

For more details about the hyperparameters tuning see Appendix B

3 Experimental Setup

3.1 Dataset

We conduct our experiments on the EmpatheticDialogues dataset (Rashkin et al., 2019), a dataset consisting of approximately 25k one-on-one open-domain conversations, grounded in a situation and a relevant emotion feeling. For all the experiments, we use the official 8:1:1 train/validation/test split defined by the authors. We group the provided emotions of each dialogue into two groups according to their sentiment polarity. 15 emotions are grouped as positive and 17 as negative.333For more details about the split see Appendix A

3.2 Models

DD MT: Multitask DodecaDialogue model, proposed in Shuster et al. (2020).

DD MT+FT: Multitask DodecaDialogue model finetuned on EmpatheticDialogues dataset, proposed in Shuster et al. (2020).

Baseline: T5 model finetuned on EmpatheticDialogues for response generation.

EmpBot: Proposed T5-based model finetuned on EmpatheticDialogues using the proposed loss. Further details for the implementation, the training and testing procedures are provided in Appendix B.

3.3 Evaluation Protocol

We evaluate our models using both automatic and human evaluation. Although automated metrics can measure both the ability of the model to reproduce the listener’s response and the diversity of the responses, they do not always correlate with human judgements of dialogue quality (Liu et al., 2016). Nevertheless, we report both automatic metrics and human evaluation scores.

BST Generative - (Roller et al., 2021)
DD MT (SOTA) (Shuster et al., 2020)
DD MT+FT (SOTA) (Shuster et al., 2020)
Baseline -
EmpBot -
Table 1: Test performance for automated metrics of the current state-of-the-art approaches, our baseline and the EmpBot model. Results with were reported only on the validation set.
Win | Loss Win | Loss Win | Loss
EmpBot vs DD MT+FT 57.14% | 42.86% 57.73% | 42.27% 56.56% | 43.44%
EmpBot vs Baseline 63.81% | 36.19% 65.71% | 34.29% 59.05% | 40.95%
DD MT+FT vs Baseline 59.05% | 40.95% 60.95% | 39.05% 60% | 40%
Table 2: Results of human A/B testing for each sub-task. All results are statistically significant with p<0.05 using the binomial test.

width=center Model Name DD MT+FT 3.42 3.96* 3.33* EmpBot 3.48 4.21* 3.56*

Table 3: Human rating test’s absolute scores for Relevance, Fluency and Empathy. Results noted with * are statistically significant with p<0.05 using the Mann-Witney U test.

Automated metrics: We report the perplexity (PPL) of the actual (gold) response as in Wen et al., 2015; Li et al., 2016a, b. Moreover, we report BLEU scores (Papineni et al., 2002) between the model and the gold response.

Human evaluation: In order to measure the quality of the generated responses, we conduct human evaluation, through an online survey. The human evaluation process is split in two phases. In the first phase, we compare the EmpBot model with the current state-of-the-art DD MT+FT (Shuster et al., 2020). Participants were asked to do a pairwise comparison between the generated responses of the aforementioned models according to: Relevance and Fluency given the dialogue context (denoted by in Table 2), Empathy given the dialogue context and the speaker’s sentiment (denoted by in Table 2) and Empathy given the dialogue context and the speaker’s emotion (denoted by in Table 2). Moreover, participants were also asked to rate each generated response on the three following aspects: Relevance, Fluency and Empathy, given the dialogue context for each model using a 1-5 Likert scale, where 5 is the best score. So, participants had to complete 3 A/B testing sub-tasks in order to directly compare the models and 2 rating sub-tasks for an indirect comparison.
In the second phase, we compare our EmpBot model against our baseline and the state-of-the-art DD MT+FT model against our baseline. In that phase participants were asked to compare the generated responses using the same format of the (3) A/B testing sub-tasks of the first phase.444For details about the human evaluation see Appendix C

4 Results

Evaluation results and a comparison with other models are presented in Table 1. The human evaluation results are shown in Tables 2 and 3. The DD MT+FT model still maintains the state-of-the-art performance in perplexity with the EmpBot achieving a somewhat lower performance (8.5% difference). However, we notice that both our baseline and the EmpBot model outperform the current state-of-the-art model in terms of the average BLEU score by achieving scores as low as 8.89 and 8.84 respectively. Consequently, our empathetic approach (EmpBot model) improves the state-of-the-art BLEU score, which was achieved by the DD MT, by a difference of 5.2%. We also notice that our baseline performs slightly better on BLEU score metric than the EmpBot, but the difference is not significant.
However, as the usefulness of the BLEU score has been questioned we turn to human evaluation for a more precise measure of quality. About the human evaluation results, we notice that the EmpBot model outperforms both the DD MT+FT and our baseline achieving significantly better results both on A/B and rating tests, as shown in Table 2 and 3 respectively. More specifically in Table 3, we notice a significant difference in Fluency and Empathy scores, between the EmpBot and the DD MT+FT, which shows that not only our approach is more empathetic, but the generated responses seem to be more fluent too. About the absolute Relevance score, we notice that there is not a significant difference. In addition, we should note that according to the A/B test, shown in Table 2 the DD MT+FT model seems to perform better than our baseline in all sub-tasks. We provide examples of the generated responses in Appendix E.

5 Conclusions

In this work we propose EmpBot, a T5-based chatbot, augmented with a novel finetuning procedure for generating empathetic dialogue responses. The proposed loss consists of three parts: an LM loss that produces valid textual responses, a sentiment classification loss that introduces emotional awareness to the model and an empathy forcing loss that ensures that the responses are emotionally relevant. We evaluate EmpBot

using standard evaluation metrics, i.e. perplexity and BLEU score, achieving state-of-the-art results. Our human evaluation results indicate that

EmpBot produces more fluent and empathetic responses, when compared with both the baseline and the state-of-the-art models. In the future we want to extend the proposed method for other architectures, and explore more empathy forcing losses using raw emotion values instead of sentiment polarities.


Appendix A Dataset Seperation Split Details

We split the 32 provided emotion annotations according to their sentiment polarity, as illustrated in Table 4.

Positive Negative
surprised, excited,
proud, grateful,
impressed, hopeful,
confident, joyful,
content, caring,
trusting, faithful,
prepared, sentimental,
angry, sad,
afraid, terrified,
furious, anxious,
nostalgic, disappointed,
jealous, devastated,
embarrassed, ashamed,
Table 4: Separation split of 32 emotions based on their valence.

Appendix B Implementation & Training Details

We use the T5-base model from the HuggingFace library having 12 layers, 768 hidden-states, 3072 feed-forward hidden-states and 12 heads. We also use a 300-d dimensional space for the sentiment representations obtained from the 2-layer classifier. Finally, the baseline and the EmpBot model have M and M parameters respectively.
During training, we set to and to . After experimenting with various empirically selected value pairs for the parameters and , we found that the selected values yield the slightly best PPL for the validation set. We use the Adam optimizer, setting the learning rate equal to and the weight decay to . We also use a batch size of 4. All hyperparameters were manually tuned and the set with the best validation perplexity was chosen. All models were trained in a single Tesla K80 GPU provided by Google Colab.
During inference time, we use top-p (nucleus) sampling method (Holtzman et al., 2020) with top-k filtering (Fan et al., 2018a)

, by setting threshold probability

equal to 0.9 and to 10. We also add length penalty equal to 0.6 and we set the maximum length of the generated response to be equal to 40.

Appendix C Human Evaluation Details

The human evaluation study was completed by proficient English speakers, volunteers responding to a corresponding request we posted at our university’s and research institute’s communication channels. All tests were blind and the participants could not tell which model the various dialogue responses to be evaluated were coming from. For the A/B testing, participants were asked to select the best-generated response (according to Relevance and Fluency, Empathy given the emotion of the context, and Empathy given the sentiment of the context - 3 sub-tasks ). For the rating tests, participants were asked to rate (1-5 Leikert scale) each model independently (2 sub-tasks) in terms of Empathy, Relevance and Fluency. The following clarifications were given for each metric: “Relevance evaluates whether the generated response is on-topic with the dialogue context”, “Fluency measures the grammatical correctness and readability of the generated response” and “Empathy measures whether the generated response shows the understanding of the speaker’s feelings”.
During our the evaluation study each user was presented with 7 conversations. These conversations were randomly sampled from the whole test set (2547 conversations). Therefore the participants were presented with 343 (as 49 participants took part in the first phase) and 105 (as 15 participants took part in the second phase) unique conversations, in total, during the first and second phases of the study respectively.
For testing the statistical significance of the evaluation process, we used the binomial test for the A/B testing sub-tasks as in Shuster et al., 2020 and the Mann-Whitney U non-parametric test (Nachar, 2008) for the rating sub-tasks, as it is more robust (Rosenberg and Ramabhadran, 2017)

than the t-test in Mean Opinion Scores (MOS) tests.

Appendix D Additional Results

A full comparison, based on automatic evaluation, between the baseline, the EmpBot model and other existing approaches is presented in Table 5.

Vaswani Full Transformer (Rashkin et al., 2019)
EmoPrepend-1 (Rashkin et al., 2019)
TopicPrepend-1 (Rashkin et al., 2019)
CAiRE (Lin et al., 2019)
GPT2-prepend (Santhanam and Shaikh, 2019)
MoEL - (Lin et al., 2019)
MIME - (Majumder et al., 2020)
HappyBot - (Shin et al., 2019)
BST Generative - (Roller et al., 2021)
DD MT (SOTA) (Shuster et al., 2020)
DD MT+FT (SOTA) (Shuster et al., 2020)
Baseline -
EmpBot -
Table 5: Test performance for automated metrics of the existing approaches, our baseline and the EmpBot model. Automated metrics noted with were reported only on validation set.

Appendix E Example Outputs

We provide some examples of the generated responses for both the EmpBot, the baseline and the DD MT+FT models in Table 6.

Example 1
Speaker: I felt so let down by a person who promised me a project – but then pulled it
away at the last minute in a way that was very personal and reactionary.
Target: That’s terrible. People need to learn to commit to their promises.
Emotion: disappointed
EmpBot: That’s not good. I’m sorry you felt that way.
baseline: That’s terrible. What was the project?
DD MT+FT: what was the project about ?
Example 2
Speaker: I had to give a presentation in front of my entire University class last semester. I
hate presenting things.
Target: oh, that is nerve wracking. What did you try to do to cope?
Emotion: anxious
EmpBot: That’s a real shame. I’m sure you did a great job.
baseline: I’m sorry to hear that.
DD MT+FT: how did it go ?
Example 3
Speaker: I am going to my local amusement park tomorrow and feeling a certain way about
Target: Which way are you feeling? Are you a thrill seeker?
Emotion: excited
EmpBot: That sounds like fun! I love going to the amusement park!
baseline: I’m sorry to hear that.
DD MT+FT: what amusement park are you going to ?
Example 4
Speaker: I felt very isolated when we first moved to Hong Kong – I had a young baby and
my husband pretty much needed to start traveling for business right away!
Target: Wow, I can’t imagine how you felt living in Hong Kong. That would take some
serious adapting.
Emotion: lonely
EmpBot: I’m sorry to hear that. Is he still in Hong Kong?
baseline: That’s a bummer. I’m sorry to hear that.
DD MT+FT: i ’ m sorry to hear that . did you have a lot of fun ?
Table 6: Examples of the generated responses for each model, along with the target response and the corresponding emotion of the speaker.