MoEL: Mixture of Empathetic Listeners

08/21/2019 ∙ by Zhaojiang Lin, et al. ∙ The Hong Kong University of Science and Technology 0

Previous research on empathetic dialogue systems has mostly focused on generating responses given certain emotions. However, being empathetic not only requires the ability of generating emotional responses, but more importantly, requires the understanding of user emotions and replying appropriately. In this paper, we propose a novel end-to-end approach for modeling empathy in dialogue systems: Mixture of Empathetic Listeners (MoEL). Our model first captures the user emotions and outputs an emotion distribution. Based on this, MoEL will softly combine the output states of the appropriate Listener(s), which are each optimized to react to certain emotions, and generate an empathetic response. Human evaluations on empathetic-dialogues (Rashkin et al., 2018) dataset confirm that MoEL outperforms multitask training baseline in terms of empathy, relevance, and fluency. Furthermore, the case study on generated responses of different Listeners shows high interpretability of our model.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural network approaches for conversation models have shown to be successful in scalable training and generating fluent and relevant responses Vinyals and Le (2015). However, it has been pointed out by Li et al. (2016a, b, c); Wu et al. (2018b)

that only using Maximum Likelihood Estimation as the objective function tends to lead to

generic and repetitive responses like “I am sorry”. Furthermore, many others have shown that the incorporation of additional inductive bias leads to a more engaging chatbot, such as understanding commonsense Dinan et al. (2018), or modeling consistent persona Li et al. (2016b); Zhang et al. (2018b); Mazare et al. (2018a).

Emotion: Angry
I was furious when I got in
my first car wreck.
I was driving on the interstate and
another car ran into the back of me.
Wow. Did you get hurt?
Sounds scary.
No just the airbags went off and
I hit my head and got a few bruises.
I am always scared about those
airbags! I am so glad you are ok!
Table 1: One conversation from empathetic dialogue, a speaker tells the situation he(she) is facing, and a listener try to understand speaker’s feeling and respond accordingly

Meanwhile, another important aspect of an engaging human conversation that received relatively less focus is emotional understanding and empathy Rashkin et al. (2018); Dinan et al. (2019); Wolf et al. (2019). Intuitively, ordinary social conversations between two humans are often about their daily lives that revolve around happy or sad experiences. In such scenarios, people generally tend to respond in a way that acknowledges the feelings of their conversational partners.

Table 1 shows an conversation from the empathetic-dialogues dataset Rashkin et al. (2018) about how an empathetic person would respond to the stressful situation the Speaker has been through. However, despite the importance of empathy and emotional understanding in human conversations, it is still very challenging to train a dialogue agent able to recognize and respond with the correct emotion.

Figure 1: The proposed model Mixture of Empathetic Listeners, which has an emotion tracker, empathetic listeners along with a shared listener, and a meta listener to fuse the information from listeners and produce the empathetic response.

So far, to solve the problem of empathetic dialogue response generation, which is to understand the user emotion and respond appropriately Bertero et al. (2016), there have been mainly two lines of work. The first is a multi-task approach that jointly trains a model to predict the current emotional state of the user and generate an appropriate response based on the state Lubis et al. (2018); Rashkin et al. (2018). Instead, the second line of work focuses on conditioning the response generation to a certain fixed emotion Hu et al. (2017); Wang and Wan (2018); Zhou and Wang (2018); Zhou et al. (2018).

Both cases have succeeded in generating empathetic and emotional responses, but have neglected some crucial points in empathetic dialogue response generation. 1) The first assumes that by understanding the emotion, the model implicitly learns how to respond appropriately. However, without any additional inductive bias, a single decoder learning to respond for all emotions will not only lose interpretability in the generation process, but will also promote more generic responses. 2) The second assumes that the emotion to condition the generation on is given as input, but we often do not know which emotion is appropriate in order to generate an empathetic response.

Therefore, in this paper, to address the above issues, we propose a novel end-to-end empathetic dialogue agent, called Mixture of Empathetic Listeners 111The code will be released at (MoEL) inspired by Shazeer et al. (2017). Similar to Rashkin et al. (2018), we first encode the dialogue context and use it to recognize the emotional state ( possible emotions). However, the main difference is that our model consists of decoders, further denoted as listeners, which are optimized to react to each context emotion accordingly. The listeners are trained along with a Meta-listener that softly combines the output decoder states of each listener according to the emotion classification distribution. Such design allows our model to explicitly learn how to choose an appropriate reaction based on its understanding of the context emotion. A detailed illustration of MoEL is shown in Figure 1.

The proposed model is tested against several competitive baseline settings Vaswani et al. (2017); Rashkin et al. (2018), and evaluated with human judges. The experimental results show that our approach outperforms the baselines in both empathy and relevance. Finally, our analysis demonstrates that not only MoEL effectively attends to the right listener, but also each listener learns how to properly react to its corresponding emotion, hence allowing a more interpretable generative process.

2 Related Work

Conversational Models:

Open domain conversational models has been widely studied Serban et al. (2016); Vinyals and Le (2015); Wolf et al. (2019). A recent trend is to produce personalized responses by conditioning the generation on a persona profile to make the response more consistent through the dialogue Li et al. (2016b). In particular, PersonaChat Zhang et al. (2018a); Kulikov et al. (2018) dataset was created, and then extended in ConvAI 2 challenge Dinan et al. (2019), to show that by adding persona information as input to the model, the produced responses elicit more consistent personas. Based on such, several follow-up work has been presented Mazare et al. (2018b); Hancock et al. (2019); Joshi et al. (2017); Kulikov et al. (2018); Yavuz et al. (2018); Zemlyanskiy and Sha (2018); Madotto et al. (2019). However, such personalized dialogue agents focus only on modeling a consistent persona and often neglect the feelings of their conversation partners.

Another line of work combines retrieval and generation to promote the response diversity Cai et al. (2018); Weston et al. (2018); Wu et al. (2018b). However, only fewer works focus on emotion Winata et al. (2017, 2019); Xu et al. (2018); Fan et al. (2018a, c, b); Lee et al. (2019) and empathy in the context of dialogues systems Bertero et al. (2016); Chatterjee et al. (2019a, b); Shin et al. (2019). For generating emotional dialogues, Hu et al. (2017); Wang and Wan (2018); Zhou and Wang (2018) successfully introduce a framework of controlling the sentiment and emotion of the generated response, while Zhou and Wang (2018) also introduces a new Twitter conversation dataset and propose to distantly supervised the generative model with emojis. Meanwhile, Lubis et al. (2018); Rashkin et al. (2018) also introduce new datasets for empathetic dialogues and train multi-task models on it.

Mixture of Experts:

The idea of having specialized parameters, or so-called experts, has been widely studied topics in the last two decades Jacobs et al. (1991); Jordan and Jacobs (1994). For instance, different architectures and methodologies have been used such as SVM Collobert et al. (2002), Gaussian Processes Tresp (2001); Theis and Bethge (2015); Deisenroth and Ng (2015), Dirichlet Processes Shahbaba and Neal (2009), Hierarchical Experts Yao et al. (2009), Infinite Number of Experts Rasmussen and Ghahramani (2002) and sequential expert addition Aljundi et al. (2017). More recently, the Mixture Of Expert Shazeer et al. (2017); Kaiser et al. (2017) model was proposed which added a large number of experts in between of two LSTM Schmidhuber (1987) layers to enhance the capacity of the model. This idea of having independent specialized experts inspires our approach to model the reaction to each emotion with a separate expert.

3 Mixture of Empathetic Listeners

The dialogue context is an alternating set of utterances from speaker and listener. We denote the dialogue context as and the speaker emotion state at each utterance as where . Then, our model aims to track the speaker emotional state from the dialogue context , and generates an empathetic response .

Overall, MoEL is composed of three components: an emotion tracker, emotion-aware listeners, and a meta listener as shown in Figure 1. The emotion tracker (which is also the context encoder) encodes and computes a distribution over the possible user emotions. Then all the listeners independently attend to this distribution to compute their own representation. Finally, the meta listener takes the weighted sum of representations from the listeners and generates the final response.

Figure 2: Context embedding is computed by summing up the word embedding, dialogue state embedding and positional embedding for each token.

3.1 Embedding

We define the context embedding , and the response embedding which are used to convert tokens into embeddings. In multi-turn dialogues, ensuring that the model is able to distinguish among turns is essential, especially when multiple emotion are present in different turns. Hence, we incorporate a dialogue state embedding in the input. This is used to enable the encoder to distinguish speaker utterances and listener utterances Wolf et al. (2019). As shown in Figure 2, our context embedding is the positional sum of the word embedding , the positional embedding  Vaswani et al. (2017) and the dialogue state embedding .


3.2 Emotion Tracker

MoEL uses a standard transformer encoder Vaswani et al. (2017) for the emotion tracker. We first flatten all dialogue turns in , and map each token into its vectorial representation using the context embedding . Then the encoder encodes the context sequence into a context representation. We add a query token at the beginning of each input sequence as in BERT Devlin et al. (2018)

, to compute the weighted sum of the output tensor. Denoting a transformer encoder as

, then corresponding context representation become:


where denotes concatenation, where is the sequence length. Then, we define the final representation of the token as


where , which is then used as the query for generating the emotion distribution.

3.3 Emotion Aware Listeners

The emotion aware listeners mainly consist of 1) a shared listener that learns shared information for all emotions and 2) independently parameterized Transformer decoders Vaswani et al. (2017) that learn how to appropriately react given a particular emotional state. All the listeners are modeled by a standard transformer decoder layer block, denoted as , which is made of three sub-components: a multi-head self-attention over the response input embedding, a multi-head attention over the output of the emotion tracker, and a position-wise fully connected feed-forward network.

Thus, we define the set of listeners as . Given the target sequence shifted by one , each listener compute its own emotional response representation :


where refers to the -th listener, including the shared one. Conceptually, we expect that the output from the shared listener, , to be a general representation which can help the model to capture the dialogue context. On the other hand, we expect that each empathetic listener learns how to respond to a particular emotion. To model this behavior, we assign different weights to each empathetic listener according to the user emotion distribution, while assigning a fixed weight of 1 to the shared listener.

To elaborate, we construct a Key-Value Memory Network Miller et al. (2016)

and represent each memory slot as a vector pair

, where denotes the key vector and is from Equation 4. Then, the encoder informed query is used to address the key vectors by performing a dot product followed by a Softmax function. Thus, we have:


each is the score assigned to , thus used as the weight of each listener. During training, given the speaker emotion state , we supervise each weight

by maximizing the probability of the emotion state

with a cross entropy loss function:


Finally, the combined output representation is compute by the weighted sum of the memory values and the shared listener output .

Figure 3: Top-1 and Top-5 emotion detection accuracy over 32 emotions at each turn

3.4 Meta Listener

Finally, the Meta Listener is implemented using another transformer decoder layer, which further transform the representation of the listeners and generates the final response. The intuition is that each listener specializes to a certain emotion and the Meta Listener gathers the opinions generated by multiple listeners to produce the final response. Hence, we define another , and an affine transformation to compute:


where is the output of meta listener and is a distribution over the vocabulary for the next tokens. We then use a standard maximum likelihood estimator (MLE) to optimize the response prediction:


Lastly, all the parameters are jointly trained end-to-end to optimize the listener selection and response generation by minimizing the weighted-sum of two losses:


Where and

are hyperparameters to balance two loss.

Params. BLEU Empathy Relevance Fluency
Gold - - 3.93 3.93 3.35
TRS 16.94M 3.02 3.32 3.47 3.52
MultiTRS 16.95M 2.92 3.36 3.57 3.31
MoEL 23.1M 2.90 3.44 3.70 3.47
Table 2: Comparison between our proposed methods and baselines. All of models receive close BLEU score. MoEL achieve highest Empathy and Relevance score, while TRS achieve better Fluency score. The number of parameters for each model is reported.
Model Win Loss Tie
MoEL vs TRS 37.3% 18.7% 44%
MoEL vs Multi-TRS 36.7% 32.6% 30.7%
Table 3: Result of human A/B test. Tests are conducted pairwise between MoEL and baseline models

4 Experiment

4.1 Dataset

We conduct our experiment on the empathetic-dialogues Rashkin et al. (2018) dataset which consist of 25k one-to-one open-domain conversation grounded in emotional situations. The dataset provides 32, evenly distributed, emotion labels. Table 1 shows an example from the training set. The speakers are talking about their situation and the listeners is trying to understand their feeling and reply accordingly. At training time the emotional labels of the speakers are given, while we hide the label in test time to evaluate the empathy of our model.

4.2 Training

We train our model using Adam optimizer Kingma and Ba (2014) and varied the learning rate during training following Vaswani et al. (2017). The weight of both losses and are set to 1 for simplicity. We use pre-trained Glove vectors Pennington et al. (2014) to initialize the word embedding and we share it across the encoder and the decoder. The rest of the parameters are randomly initialized.

In the early training stage, emotion tracker randomly assign weights to the listeners, and may send noisy gradient flow back to the wrong listeners, which can make the model convergence harder. To stabilize the learning process, we replace the distribution of the listeners with the oracle emotion information using a certain probability , and we gradually anneal it during the training. We set an annealing rate , and a threshold equal to , thus at each iteration iteration we compute:

Emotion Disgusted
Speaker: i had to dog-sit for my dad about a week ago . i had to clean up dog
poop at least 10 times , it was awful .
TRS that is horrible . i hope you did not have to clean it up .
Multi-TRS oh my goodness , that must have been so scary !
MoEL oh wow , that is pretty gross .
Goal ohh , cleaning dog poop is always been annoying , yet petting the dog is always amusing
Emotion Disappointed
Speaker: i was turned down for a job i really thought i would be a good fit at .
Listener: did they give you any reason for not giving you the job ?
Speaker: someone else interviewed who had much more experience than i did .
TRS that is good . i hope you have a good time .
Multi-TRS that is a good idea . i hope it was a good idea .
MoEL i am sorry to hear that . i am sure you will do great again
Goal well , at least a candidate that deserved the job got it .

Speaker: so my friend ’s baby fell down and scraped his knee yesterday and i was trying so hard
to cheer him up.
: that was really nice of you. I am glad you were able to help him.
oh wow ! that must have been so scary .
oh no , i am so sorry to hear that .
that sounds like a good idea .
that was awesome ! did he get any good at you ?
Speaker: my husband lost a job but I am hoping he can find a full-time job soon.
: what kind of job is it?
oh that sucks . you should have been together .
i am so sorry . i am sure he will be fine .
that sounds awesome . what kind of job did he get you for ?
oh wow ! congratulations to him . you must be proud of him .
Table 4: Generated responses from TRS, Multi-TRS and MoEL in 2 different user emotion states (top) and comparing generation from different listeners (bottom). We use hard attention on Terrified, Sad, Excited and Proud listeners.

4.3 Baseline

We compare our model with two baselines:

Transformer (TRS)

The standard Transformer model Vaswani et al. (2017) that is trained to minimize MLE loss as in Equation 10.

Multitask Transformer (Multi-TRS)

A Multitask Transformer trained as Rashkin et al. (2018) to incorporate additional supervised information about the emotion. The encoder of multitask transformer is the same as our emotion tracker, and the context representation , from Equation 3

, is used as input to an emotion classifier. The whole model is jointy trained by optimizing both the classification and generation loss.

Figure 4: The visualization of attention on the listeners: The left side is the context followed by the responses generated by MoEL. The heat map illustrate the attention weights on 32 listeners

4.4 Hyperparameter

In all of our experiments we used 300 dimensional word embedding and 300 hidden size everywhere. We use 2 self-attention layers made up of 2 attention heads each with embedding dimension 40. We replace Positionwise Feedforward sub-layer with 1D convolution with 50 filters of width 3. We train all of models with batch size 16 and we use batch size 1 in the test time.

4.5 Evaluation Metrics


We compute BLEU scores Papineni et al. (2002) to compare the generated response against human responses. However, in open-domain dialogue response generation, BLEU is not a good measurement of generation quality Liu et al. (2016), so we use BLEU only as a reference.

Human Ratings

In order to measure the quality of the generated responses, we conduct human evaluations with Amazon Mechanical Turk. Following Rashkin et al. (2018), we first randomly sample 100 dialogues and their corresponding generations from MoEL and the baselines. For each response, we assign three human annotators to score the following aspect of models: Empathy, Relevance, and Fluency. Note that we evaluate each metric independently and the scores range between 1 and 5, in which 1 is ”not at all” and 5 is ”very much”.

We ask the human judges to evaluate each of the following categories from a 1 to 5 scale, where 5 is the best score.

  • Empathy / Sympathy: Did the responses from the LISTENER show understanding of the feelings of the SPEAKER talking about their experience?

  • Relevance: Did the responses of the LISTENER seem appropriate to the conversation? Were they on-topic?

  • Fluency: Could you understand the responses from the LISTENER? Did the language seem accurate?

Human A/B Test

In this human evaluation task, we aim to directly compare the generated responses with each other. We randomly sample 100 dialogues each for MoEL vs {TRS, Multi-TRS}. Three workers are given randomly ordered responses from either MoEL or {TRS, Multi-TRS}, and are prompted to choose the better response. They can either choose one of the responses or select tie when the provided options are either both good or both bad.

5 Results

Emotion detection

To verify whether our model can attend to the appropriate listeners, we compute the emotion detection accuracy for each turn. Our model achieve , , in terms of top-1, top-3, top-5 detection accuracy over 32 emotions. We notice that some emotions frequently appear in similar context (e.g., Annoyed, Angry, Furious) which might degrade the detection accuracy. Figure 3 shows the per class accuracy in the test set. We can see that by using top-5 the majority of the emotion achieve around 80% accuracy.

Response evaluation

Both automatic and human evaluation results are shown in Table  2. TRS achieves the highest BLEU score and Fluency score but the lowest Empathy and Relevance score. This shows us that the responses generated by TRS are more generic but cannot accurately capture the user emotions. With the additional supervision on user emotions, multi-task training improves both Empathy and Relevance score, but it still degrades Fluency. In contrast, MoEL achieves the highest Empathy and Relevance score. This suggests that the multi-expert strategy helps to capture the user emotional states and context simultaneously, and elicits a more appropriate response. The human A/B tests also confirm that the responses from our model are more preferred by human judges.

6 Analysis

In order to understand whether or how MoEL can effectively improve other baselines, learn each emotion, and properly react to them, we conduct three different analysis: model response comparison, listener analysis, and visualization of the emotion distribution .

Model response comparison

The top part of Table 4 compares the generated responses from MoEL and the two baselines on two different speaker emotional states. In the first example, MoEL captures the exact emotion of the speaker, by replying with ”cleaning up dog poop is pretty gross”, instead of ”horrible” and ”scary”. In the second example, both TRS and Multi-TRS fail to understand that the speaker is disappointed about the failure of his interview, and they generate inappropriate responses. On the other hand, MoEL shows an empathetic response by comforting the speaker with ”I am sure you will do great again”. More examples can be find in the Appendix.

Listener analysis

To have a better understanding of how each listener learned to react to different context, we conduct a study of comparing responses produced by different listeners. To do so, we fix the input dialogue context and we manually modify the attention vector distribution used to produce the response. We experiment with the correct listener and four other listeners: , , , . Given the same context, we expect that different listeners will react differently, as this is our inductive bias. For example, is optimized to comfort sad people, and share the positive emotions from the user. From the generation results in the bottom parts of Table 4 we can see that the corresponding listeners can produce empathetic and relevant responses when they reasonably match the speaker emotions. However, when the expected emotion label is opposite to the selected listener, such as caring and sad, the response becomes emotionally inappropriate.

Interestingly, in the last example, the sad listener actually produces a more meaningful response by encouraging the speaker. This is due to the first part of the context which conveys a sad emotion. On the other hand, for the same example, the excited listener responds with very relevant yet unsympathetic response. In addition, as many dialogue contexts contain multiple emotions, being able to capture them would lead to a better understanding of the speaker emotional state.

Visualization of Emotion Distribution

Finally, to understand how MoEL chooses the listener according to the context, we visualize the emotion distribution in Figure 4. In most of the cases, the model attends to the proper listeners (emotions), and generate a proper responses. This is confirmed also by the accuracy results shown in Figure 3. However, our model is sometimes focuses on parts of the dialogue context. For example, in the fifth example in Figure 4, the model fails to detect the real emotion of speaker as the context contains “I was pretty surprised” in its last turn.

On the other hand, the last three rows of the heatmap indicate that the model learns to leverage multiple listeners to produce an empathetic response. For example, when the speaker talks about some criminals that shot one of his neighbors, MoEL successfully detects both annoyed and afraid emotions from the context, and replies with an appropriate response ”that is horrible! i am glad you are okay!” that addresses both emotions. However, in the third row, the model produces ”you” instead of ”he” by mistake. Although the model is able to capture relevant emotions for this case, other emotions also have non-negligible weights which results in a smooth emotion distribution that confuses the meta listener from accurately generating a response.

7 Conclusion & Future Work

In this paper, we propose a novel way to generate empathetic dialogue responses by using Mixture of Empathetic Listeners (MoEL). Differently from previous works, our model understand the user feelings and responds accordingly by learning specific listeners for each emotion. We benchmark our model in empathetic-dialogues dataset Rashkin et al. (2018), which is a multi-turn open-domain conversation corpus grounded on emotional situations. Our experimental results show that MoEL is able to achieve competitive performance in the task with the advantage of being more interpretable than other conventional models. Finally, we show that our model is able to automatically select the correct emotional decoder and effectively generate an empathetic response.

One of the possible extensions of this work would be incorporating it with Persona Zhang et al. (2018b) and task-oriented dialogue systems Gao et al. (2018); Madotto et al. (2018); Wu et al. (2019, 2017, 2018a); Reddy et al. (2018); Raghu et al. (2019). Having a persona would allow the system to have more consistent and personalized responses, and combining open-domain conversations with task-oriented dialogue systems would equip the system with more engaging conversational capabilities, hence resulting in a more versatile dialogue system.


  • R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3366–3375. Cited by: §2.
  • D. Bertero, F. B. Siddique, C. Wu, Y. Wan, R. H. Y. Chan, and P. Fung (2016) Real-time speech emotion and sentiment recognition for interactive dialogue systems. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    pp. 1042–1047. Cited by: §1, §2.
  • D. Cai, Y. Wang, V. Bi, Z. Tu, X. Liu, W. Lam, and S. Shi (2018) Skeleton-to-response: dialogue generation guided by retrieval memory. arXiv preprint arXiv:1809.05296. Cited by: §2.
  • A. Chatterjee, U. Gupta, M. K. Chinnakotla, R. Srikanth, M. Galley, and P. Agrawal (2019a)

    Understanding emotions in text using deep learning and big data

    Computers in Human Behavior 93, pp. 309–317. Cited by: §2.
  • A. Chatterjee, K. N. Narahari, M. Joshi, and P. Agrawal (2019b) SemEval-2019 task 3: emocontext: contextual emotion detection in text. In Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval-2019), Minneapolis, Minnesota. Cited by: §2.
  • R. Collobert, S. Bengio, and Y. Bengio (2002) A parallel mixture of svms for very large scale problems. In Advances in Neural Information Processing Systems, pp. 633–640. Cited by: §2.
  • M. Deisenroth and J. W. Ng (2015) Distributed gaussian processes. In

    International Conference on Machine Learning

    pp. 1481–1490. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2.
  • E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, et al. (2019) The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098. Cited by: §1, §2.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2018)

    Wizard of wikipedia: knowledge-powered conversational agents

    ICLR. Cited by: §1.
  • Y. Fan, J. C. Lam, and V. O. Li (2018a)

    Multi-region ensemble convolutional neural network for facial expression recognition

    In International Conference on Artificial Neural Networks, pp. 84–94. Cited by: §2.
  • Y. Fan, J. C. Lam, and V. O. Li (2018b) Unsupervised domain adaptation with generative adversarial networks for facial emotion recognition. In 2018 IEEE International Conference on Big Data (Big Data), pp. 4460–4464. Cited by: §2.
  • Y. Fan, J. C. Lam, and V. O. Li (2018c) Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 584–588. Cited by: §2.
  • J. Gao, M. Galley, and L. Li (2018) Neural approaches to conversational ai. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1371–1374. Cited by: §7.
  • B. Hancock, A. Bordes, P. Mazare, and J. Weston (2019) Learning from dialogue after deployment: feed yourself, chatbot!. arXiv preprint arXiv:1901.05415. Cited by: §2.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In International Conference on Machine Learning, pp. 1587–1596. Cited by: §1, §2.
  • R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, et al. (1991) Adaptive mixtures of local experts.. Neural computation 3 (1), pp. 79–87. Cited by: §2.
  • M. I. Jordan and R. A. Jacobs (1994) Hierarchical mixtures of experts and the em algorithm. Neural computation 6 (2), pp. 181–214. Cited by: §2.
  • C. K. Joshi, F. Mi, and B. Faltings (2017) Personalization in goal-oriented dialog. arXiv preprint arXiv:1706.07503. Cited by: §2.
  • L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit (2017) One model to learn them all. arXiv preprint arXiv:1706.05137. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • I. Kulikov, A. H. Miller, K. Cho, and J. Weston (2018) Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907. Cited by: §2.
  • N. Lee, Z. Liu, and P. Fung (2019) Team yeon-zi at semeval-2019 task 4: hyperpartisan news detection by de-noising weakly-labeled data. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 1052–1056. Cited by: §2.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016a) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §1.
  • J. Li, M. Galley, C. Brockett, G. Spithourakis, J. Gao, and B. Dolan (2016b) A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 994–1003. Cited by: §1, §2.
  • J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016c)

    Deep reinforcement learning for dialogue generation

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202. Cited by: §1.
  • C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016)

    How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132. Cited by: §4.5.
  • N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura (2018) Eliciting positive emotion through affect-sensitive dialogue response generation: a neural network approach. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
  • A. Madotto, Z. Lin, C. Wu, and P. Fung (2019) Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 5454–5459. Cited by: §2.
  • A. Madotto, C. Wu, and P. Fung (2018) Mem2Seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1468–1478. Cited by: §7.
  • P. Mazare, S. Humeau, M. Raison, and A. Bordes (2018a) Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2775–2779. Cited by: §1.
  • P. Mazare, S. Humeau, M. Raison, and A. Bordes (2018b) Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2775–2779. External Links: Link Cited by: §2.
  • A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1400–1409. Cited by: §3.3.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.5.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.2.
  • D. Raghu, N. Gupta, and Mausam (2019) Disentangling Language and Knowledge in Task-Oriented Dialogs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1239–1255. External Links: Link, Document Cited by: §7.
  • H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2018) I know the feeling: learning to converse with empathy. arXiv preprint arXiv:1811.00207. Cited by: MoEL: Mixture of Empathetic Listeners, §1, §1, §1, §1, §1, §2, §4.1, §4.3, §4.5, §7.
  • C. E. Rasmussen and Z. Ghahramani (2002) Infinite mixtures of gaussian process experts. In Advances in neural information processing systems, pp. 881–888. Cited by: §2.
  • R. Reddy, D. Contractor, D. Raghu, and S. Joshi (2018) Multi-level memory for task oriented dialogs. arXiv preprint arXiv:1810.10647. Cited by: §7.
  • J. Schmidhuber (1987) Evolutionary principles in self-referential learning. on learning now to learn: the meta-meta-meta…-hook. Diploma Thesis, Technische Universitat Munchen, Germany. External Links: Link Cited by: §2.
  • I. V. Serban, R. Lowe, L. Charlin, and J. Pineau (2016) Generative deep neural networks for dialogue: a short review. arXiv preprint arXiv:1611.06216. Cited by: §2.
  • B. Shahbaba and R. Neal (2009) Nonlinear models using dirichlet process mixtures. Journal of Machine Learning Research 10 (Aug), pp. 1829–1850. Cited by: §2.
  • N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. Cited by: §1, §2.
  • J. Shin, P. Xu, A. Madotto, and P. Fung (2019) HappyBot: generating empathetic dialogue responses by improving user experience look-ahead. arXiv preprint arXiv:1906.08487. Cited by: §2.
  • L. Theis and M. Bethge (2015) Generative image modeling using spatial lstms. In Advances in Neural Information Processing Systems, pp. 1927–1935. Cited by: §2.
  • V. Tresp (2001) Mixtures of gaussian processes. In Advances in neural information processing systems, pp. 654–660. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §3.1, §3.2, §3.3, §4.2, §4.3.
  • O. Vinyals and Q. Le (2015) A neural conversational model. In International Conference on Machine Learning, Cited by: §1, §2.
  • K. Wang and X. Wan (2018) SentiGAN: generating sentimental texts via mixture adversarial networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4446–4452. Cited by: §1, §2.
  • J. Weston, E. Dinan, and A. Miller (2018) Retrieve and refine: improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pp. 87–92. Cited by: §2.
  • G. I. Winata, O. Kampman, Y. Yang, A. Dey, and P. Fung (2017) Nora the empathetic psychologist. Proc. Interspeech 2017, pp. 3437–3438. Cited by: §2.
  • G. I. Winata, A. Madotto, Z. Lin, J. Shin, Y. Xu, P. Xu, and P. Fung (2019) CAiRE_HKUST at semeval-2019 task 3: hierarchical attention for dialogue emotion classification. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 142–147. Cited by: §2.
  • T. Wolf, V. Sanh, J. Chaumond, and C. Delangue (2019)

    Transfertransfo: a transfer learning approach for neural network based conversational agents

    arXiv preprint arXiv:1901.08149. Cited by: §1, §2, §3.1.
  • C. Wu, A. Madotto, G. Winata, and P. Fung (2017) End-to-end recurrent entity network for entity-value independent goal-oriented dialog learning. In Dialog System Technology Challenges Workshop, DSTC6, Cited by: §7.
  • C. Wu, A. Madotto, G. I. Winata, and P. Fung (2018a) End-to-end dynamic query memory network for entity-value independent task-oriented dialog. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6154–6158. Cited by: §7.
  • C. Wu, R. Socher, and C. Xiong (2019) Global-to-local memory pointer networks for task-oriented dialogue. arXiv preprint arXiv:1901.04713. Cited by: §7.
  • Y. Wu, F. Wei, S. Huang, Z. Li, and M. Zhou (2018b) Response generation by context-aware prototype editing. arXiv preprint arXiv:1806.07042. Cited by: §1, §2.
  • P. Xu, A. Madotto, C. Wu, J. H. Park, and P. Fung (2018) Emo2Vec: learning generalized emotion representation by multi-task training. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 292–298. Cited by: §2.
  • B. Yao, D. Walther, D. Beck, and L. Fei-Fei (2009) Hierarchical mixture of classification experts uncovers interactions between brain regions. In Advances in Neural Information Processing Systems, pp. 2178–2186. Cited by: §2.
  • S. Yavuz, A. Rastogi, G. Chao, D. Hakkani-Tür, and A. A. AI (2018) Deepcopy: grounded response generation with hierarchical pointer networks. ConvAI Workshop@NIPS. Cited by: §2.
  • Y. Zemlyanskiy and F. Sha (2018) Aiming to know you better perhaps makes me a more engaging dialogue partner. CoNLL 2018, pp. 551. Cited by: §2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018a) Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213. External Links: Link Cited by: §2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018b) Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213. Cited by: §1, §7.
  • H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu (2018) Emotional chatting machine: emotional conversation generation with internal and external memory. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • X. Zhou and W. Y. Wang (2018) MojiTalk: generating emotional responses at scale. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1128–1137. Cited by: §1, §2.

Appendix A Supplemental Material

Label Caring
Speaker: my boyfriend is sick . he can not eat much so i made some soup for him.
Listener: you are sweet , what kind of soup did you make him ?
Speaker: i made chicken noodle soup .
TRS i am sorry to hear that . i hope he gets better !
Multi-TRS that is good , i am sure he will do great !
MoEL well that is very nice of you . i hope he gets better !
Goal i love chicken noodle , he must be so grateful .
Label Nostalgic
Speaker: i do not watch a lot of kids movies these days ,
but they always bring back fond memories.
Listener: same . what are your favorites ?
Speaker: i have seen the movie trolls probably 100 times , that and harry potter !
TRS i do not think i have ever had one of those .
Multi-TRS i have never heard of them , but i am sure you will enjoy it .
MoEL i do too . i do that all the time .
Goal that is funny . i have also seen trolls about 100 times . my daughter is now sick of it !
Label Grateful
Context Speaker: i just got a promotion at my job !
TRS that is great ! i hope you get it !
Multi-TRS that is great ! congratulations !,congratulations !,congratulations !
MoEL that is great ! what kind of job is it ?
Goal i am happy for you . what kind of work
Label Angry
Speaker: one of my coworkers really ticked me off one day .
she decided she would delegate some work to me by slamming on my desk and
demanding i do it . she had some nerve !
TRS oh no ! i hate when that happens . i am sorry to hear that .
Multi-TRS oh no ! that is so scary ! i would be so worried about that .
MoEL that is awful . did you confront her ?
Goal did you get mad at her ?
Table 5: Generated responses from TRS, Multi-TRS and MoEL in different user emotion states. Goal responses are given for reference