DialogueRNN: An Attentive RNN for Emotion Detection in Conversations

11/01/2018 ∙ by Navonil Majumder, et al. ∙ 0

Emotion detection in conversations is a necessary step for a number of applications, including opinion mining over chat history, social media threads, debates, argumentation mining, understanding consumer feedback in live conversations, etc. Currently, systems do not treat the parties in the conversation individually by adapting to the speaker of each utterance. In this paper, we describe a new method based on recurrent neural networks that keeps track of the individual party states throughout the conversation and uses this information for emotion classification. Our model outperforms the state of the art by a significant margin on two different datasets.



There are no comments yet.


page 10

page 12

Code Repositories


This repo contains implementation of several architectures for emotion recognition in conversations

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Emotion detection in conversation attracts increasing attention of the community due to its applications in many important tasks such as opinion mining over chat history and social media threads in YouTube, Facebook, Twitter, etc. In this paper, we present a method based on recurrent neural network (RNN) that can cater to these needs by processing the huge amount of available conversational data.

Current systems, including the state of the art [Hazarika et al.2018], do not distinguish different parties in a conversation in a meaningful way. They are not aware of the speaker of a given utterance. In contrast, we model individual party with party states, as the conversation flows, basing on the utterance, the context, and current party state. Our model is based on the assumption that there are three major aspects relevant to the emotion in a conversation: the speaker, the context from the preceding utterances, and the emotion of the preceding utterances. These three aspects are not necessarily independent, but their separate modeling significantly outperforms the state of the art (Table 2

). In dyadic conversations, the parties have distinct roles. Hence, to extract the context, it is crucial to consider the preceding turns of both speaker and listener at a given moment (

Fig. 1).

Our DialogueRNN employs three gated recurrent units (GRU) 

[Chung et al.2014] to model these aspects. The incoming utterance is fed into two GRUs called global GRU and party GRU to update the context and party state, respectively. The global GRU encodes corresponding party information while encoding an utterance.

Attending over this GRU gives contextual representation that has information of all preceding utterances by different parties in the conversation. The speaker state depends on this context through attention and the speaker’s previous state. This ensures that at time , the speaker state directly gets information from the speaker’s previous state and global GRU which has information on the preceding parties. Finally, the updated speaker state is fed into the emotion GRU to decode the emotion representation of the given utterance, which is used for emotion classification. At time , emotion GRU cell gets the emotion representation of and speaker state of .

The emotion GRU, along with the global GRU, plays a pivotal role in inter-party relation modeling. On the other hand, party GRU models relation between two sequential states of the same party. In DialogueRNN, all these three different types of GRUs are connected in a recurrent manner. We believe that DialogueRNN outperforms state-of-the-art contextual emotion classifiers such as

[Hazarika et al.2018, Poria et al.2017] because of better context representation.

The rest of the paper is organized as follows: Section 2 discusses related work; Section 3 provides detailed description of our model; Sections 5 and 4 present the experimental results; finally, Section 6 concludes the paper.

2 Related Work

Emotion recognition has attracted attention in various fields such as natural language processing, psychology, cognitive science, and so on

[Picard2010]. ekman1993facial ekman1993facial found correlation between emotion and facial cues.

Figure 1: In this dialogue, ’s emotion changes are influenced by the behavior of .

datcu2008semantic datcu2008semantic fused acoustic information with visual cues for emotion recognition. alm2005emotions alm2005emotions introduced text-based emotion recognition, developed in the work of strapparava2010annotating strapparava2010annotating. wollmer2010context wollmer2010context used contextual information for emotion recognition in multimodal setting. Recently, poria-EtAl:2017:Long poria-EtAl:2017:Long successfully used RNN-based deep networks for multimodal emotion recognition, which was followed by other works [Chen et al.2017, Zadeh et al.2018a, Zadeh et al.2018b].

Reproducing human interaction requires deep understanding of conversation. ruusuvuori2013emotion ruusuvuori2013emotion states that emotion plays a pivotal role in conversations. It has been argued that emotional dynamics in a conversation is an inter-personal phenomenon [Richards, Butler, and Gross2003]. Hence, our model incorporates inter-personal interactions in an effective way. Further, since conversations have a natural temporal nature, we adopt the temporal nature through recurrent network [Poria et al.2017].

Memory networks [Sukhbaatar et al.2015] has been successful in several NLP areas, including question answering [Sukhbaatar et al.2015, Kumar et al.2016], machine translation [Bahdanau, Cho, and Bengio2014], speech recognition [Graves, Wayne, and Danihelka2014], and so on. Thus, hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1 used memory networks for emotion recognition in dyadic conversations, where two distinct memory networks enabled inter-speaker interaction, yielding state-of-the-art performance.

3 Methodology

3.1 Problem Definition

Let there be parties/participants ( for the datasets we used) in a conversation. The task is to predict the emotion labels (happy, sad, neutral, angry, excited, and frustrated) of the constituent utterances , where utterance is uttered by party , while being the mapping between utterance and index of its corresponding party. Also, is the utterance representation, obtained using feature extractors described below.

3.2 Unimodal Feature Extraction

For a fair comparison with the state-of-the-art method, conversational memory networks (CMN) [Hazarika et al.2018]

, we follow identical feature extraction procedures.

Textual Feature Extraction

We employ convolutional neural networks (CNN) for textual feature extraction. Following kim2014convolutional kim2014convolutional, we obtain n-gram features from each utterance using three distinct convolution filters of sizes 3, 4, and 5 respectively, each having 50 feature-maps. Outputs are then subjected to max-pooling followed by rectified linear unit (ReLU) activation. These activations are concatenated and fed to a

dimensional dense layer, which is regarded as the textual utterance representation. This network is trained at utterance level with the emotion labels.

Audio and Visual Feature Extraction

Identical to hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1, we use 3D-CNN and openSMILE [Eyben, Wöllmer, and Schuller2010] for visual and acoustic feature extraction, respectively.

3.3 Our Model

We assume that the emotion of an utterance in a conversation depends on three major factors:

  1. the speaker.

  2. the context given by the preceding utterances.

  3. the emotion behind the preceding utterances.

Our model DialogueRNN,111Implementation available at https://github.com/senticnet/conv-emotion shown in Fig. 1(a), models these three factors as follows: each party is modeled using a party state which changes as and when that party utters an utterance. This enables the model to track the parties’ emotion dynamics through the conversations, which is related to the emotion behind the utterances. Furthermore, the context of an utterance is modeled using a global state (called global, because of being shared among the parties), where the preceding utterances and the party states are jointly encoded for context representation, necessary for accurate party state representation. Finally, the model infers emotion representation from the party state of the speaker along with the preceding speakers’ states as context. This emotion representation is used for the final emotion classification.

We use GRU cells [Chung et al.2014] to update the states and representations. Each GRU cell computes a hidden state defined as , where is the current input and is the previous GRU state. also serves as the current GRU output. We provide the GRU computation details in the supplementary. GRUs are efficient networks with trainable parameters: and .

We model the emotion representation of the current utterance as a function of the emotion representation of the previous utterance and the state of the current speaker. Finally, this emotion representation is sent to a softmax layer for emotion classification.

Figure 2: (a) DialogueRNN architecture. (b) Update schemes for global, speaker, listener, and emotion states for utterance in a dialogue. Here, Person is the speaker and Persons are the listeners.

Global State (Global GRU)

Global state aims to capture the context of a given utterance by jointly encoding utterance and speaker state. Each state also serves as speaker-specific utterance representation. Attending on these states facilitates the inter-speaker and inter-utterance dependencies to produce improved context representation. The current utterance changes the speaker’s state from to . We capture this change with GRU cell with output size , using and :



is the size of global state vector,

is the size of party state vector, , , , , , is party state size, and represents concatenation.

Party State (Party GRU)

DialogueRNN keeps track of the state of individual speakers using fixed size vectors through out the conversation. These states are representative of the speakers’ state in the conversation, relevant to emotion classification. We update these states based on the current (at time ) role of a participant in the conversation, which is either speaker or listener, and the incoming utterance . These state vectors are initialized with null vectors for all the participants. The main purpose of this module is to ensure that the model is aware of the speaker of each utterance and handle it accordingly.

Speaker Update (Speaker GRU):

Speaker usually frames the response based on the context, which is the preceding utterances in the conversation. Hence, we capture context relevant to the utterance as follows:


where are preceding global states (), , , and . In Eq. 2, we calculate attention scores over the previous global states representative of the previous utterances. This assigns higher attention scores to the utterances emotionally relevant to . Finally, in Eq. 4 the context vector is calculated by pooling the previous global states with .

Now, we employ a GRU cell to update the current speaker state to the new state based on incoming utterance and context using GRU cell of output size


where , , , and . This encodes the information on the current utterance along with its context from the global GRU into the speaker’s state , which helps in emotion classification down the line.

Listener Update:

Listener state models the listeners’ change of state due to the speaker’s utterance. We tried two listener state update mechanisms:

  • Simply keep the state of the listener unchanged, that is

  • Employ another GRU cell to update the listener state based on listener visual cues (facial expression) and its context , as


    where , , , and . Listener visual features of party at time are extracted using the model introduced by DBLP:journals/corr/abs-1710-07557 DBLP:journals/corr/abs-1710-07557, pretrained on FER2013 dataset, where feature size .

The simpler first approach turns out to be sufficient, since the second approach yields very similar result while increasing number of parameters. This is due to the fact that a listener becomes relevant to the conversation only when he/she speaks. In other words, a silent party has no influence in a conversation. Now, when a party speaks, we update his/her state with context which contains relevant information on all the preceding utterances, rendering explicit listener state update unnecessary. This is shown in Table 2.

Emotion Representation (Emotion GRU)

We infer the emotionally relevant representation of utterance from the speaker’s state and the emotion representation of the previous utterance . Since context is important to the emotion of the incoming utterance , feeds fine-tuned emotionally relevant contextual information from other the party states into the emotion representation . This establishes a connection between the speaker state and the other party states. Hence, we model with a GRU cell () with output size as


where is the size of emotion representation vector, , , , and .

Since speaker state gets information from global states, which serve as speaker-specific utterance representation, one may claim that this way the model already has access to the information on other parties. However, as shown in the ablation study (Section 5.6) emotion GRU helps to improve the performance by directly linking states of preceding parties. Further, we believe that speaker and global GRUs (, ) jointly act similar to an encoder, whereas emotion GRU serves as a decoder.

Emotion Classification

We use a two-layer perceptron with a final softmax layer to calculate

emotion-class probabilities from emotion representation

of utterance and then we pick the most likely emotion class:


where , , , , , and is the predicted label for utterance .


We use categorical cross-entropy along with L2-regularization as the measure of loss () during training:


where is the number of samples/dialogues, is the number of utterances in sample ,

is the probability distribution of emotion labels for utterance

of dialogue , is the expected class label of utterance of dialogue , is the L2-regularizer weight, and is the set of trainable parameters where

We used stochastic gradient descent based Adam 

[Kingma and Ba2014]

optimizer to train our network. Hyperparameters are optimized using grid search (values are added to the supplementary material).

3.4 DialogueRNN Variants

We use DialogueRNN (Section 3.3) as the basis for the following models:

DialogueRNN + Listener State Update (DialogueRNN):

This variant updates the listener state based on the the resulting speaker state , as described in Eq. 7.

Bidirectional DialogueRNN (BiDialogueRNN):

Bidirectional DialogueRNN is analogous to bidirectional RNNs, where two different RNNs are used for forward and backward passes of the input sequence. Outputs from the RNNs are concatenated in sequence level. Similarly, in BiDialogueRNN, the final emotion representation contains information from both past and future utterances in the dialogue through forward and backward DialogueRNNs respectively, which provides better context for emotion classification.

DialogueRNN + attention (DialogueRNN+Att):

For each emotion representation , attention is applied over all surrounding emotion representations in the dialogue by matching them with (Eqs. 14 and 13). This provides context from the relevant (based on attention score) future and preceding utterances.

Bidirectional DialogueRNN + Emotional attention (BiDialogueRNN+Att):

For each emotion representation of BiDialogueRNN, attention is applied over all the emotion representations in the dialogue to capture context from the other utterances in dialogue:


where , , , and . Further, are fed to a two-layer perceptron for emotion classification, as in Eqs. 11, 10 and 9.

4 Experimental Setting

4.1 Datasets Used

We use two emotion detection datasets IEMOCAP [Busso et al.2008] and AVEC [Schuller et al.2012] to evaluate DialogueRNN. We partition both datasets into train and test sets with roughly ratio such that the partitions do not share any speaker. Table 1 shows the distribution of train and test samples for both dataset.

Dataset Partition Utterance Dialogue
Count Count
IEMOCAP train + val 5810 120
test 1623 31
AVEC train + val 4368 63
test 1430 32
Table 1: Dataset split ((train + val) / test ).


IEMOCAP [Busso et al.2008] dataset contains videos of two-way conversations of ten unique speakers, where only the first eight speakers from session one to four belong to the train-set. Each video contains a single dyadic dialogue, segmented into utterances. The utterances are annotated with one of six emotion labels, which are happy, sad, neutral, angry, excited, and frustrated.


AVEC [Schuller et al.2012] dataset is a modification of SEMAINE database [McKeown et al.2012]

containing interactions between humans and artificially intelligent agents. Each utterance of a dialogue is annotated with four real valued affective attributes: valence (

), arousal (), expectancy (), and power (). The annotations are available every 0.2 seconds in the original database. However, in order to adapt the annotations to our need of utterance-level annotation, we averaged the attributes over the span of an utterance.

Happy Sad Neutral Angry Excited Frustrated Average(w) Valence Arousal Expectancy Power
Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1
CNN 27.77 29.86 57.14 53.83 34.33 40.14 61.17 52.44 46.15 50.09 62.99 55.75 48.92 48.18 0.545 -0.01 0.542 0.01 0.605 -0.01 8.71 0.19
memnet 25.72 33.53 55.53 61.77 58.12 52.84 59.32 55.39 51.50 58.30 67.20 59.00 55.72 55.10 0.202 0.16 0.211 0.24 0.216 0.23 8.97 0.05
c-LSTM 29.17 34.43 57.14 60.87 54.17 51.81 57.06 56.73 51.17 57.95 67.19 58.92 55.21 54.95 0.194 0.14 0.212 0.23 0.201 0.25 8.90 -0.04
c-LSTM+Att 30.56 35.63 56.73 62.90 57.55 53.00 59.41 59.24 52.84 58.85 65.88 59.41 56.32 56.19 0.189 0.16 0.213 0.25 0.190 0.24 8.67 0.10
CMN (SOTA) 25.00 30.38 55.92 62.41 52.86 52.39 61.76 59.83 55.52 60.25 71.13 60.69 56.56 56.13 0.192 0.23 0.213 0.29 0.195 0.26 8.74 -0.02
DialogueRNN 31.25 33.83 66.12 69.83 63.02 57.76 61.76 62.50 61.54 64.45 59.58 59.46 59.33 59.89 0.188 0.28 0.201 0.36 0.188 0.32 8.19 0.31
DialogueRNN 35.42 35.54 65.71 69.85 55.73 55.30 62.94 61.85 59.20 62.21 63.52 59.38 58.66 58.76 0.189 0.27 0.203 0.33 0.188 0.30 8.21 0.30
BiDialogueRNN 32.64 36.15 71.02 74.04 60.47 56.16 62.94 63.88 56.52 62.02 65.62 61.73 60.32 60.28 0.181 0.30 0.198 0.34 0.187 0.34 8.14 0.32
DialogueRNN+Att 28.47 36.61 65.31 72.40 62.50 57.21 67.65 65.71 70.90 68.61 61.68 60.80 61.80 61.51 0.173 0.35 0.168 0.55 0.177 0.37 7.91 0.35
BiDialogueRNN+Att 25.69 33.18 75.10 78.80 58.59 59.21 64.71 65.28 80.27 71.86 61.15 58.91 63.40 62.75 0.168 0.35 0.165 0.59 0.175 0.37 7.90 0.37
Table 2: Comparison with the baseline methods for textual modality; Acc. = Accuracy, = Mean Absolute Error, = Pearson correlation coefficient; bold font denotes the best performances. Average(w) = Weighted average.
TFN 56.8 0.01 0.10 0.12 0.12
MFN 53.5 0.14 25 0.26 0.15
c-LSTM 58.3 0.14 0.23 0.25 -0.04
CMN 58.5 0.23 0.30 0.26 -0.02
BiDialogueRNN+att 62.7 0.35 0.59 0.37 0.37
BiDialogueRNN+att 62.9 0.37 0.60 0.37 0.41
Table 3: Comparison with the baselines for trimodal (T+V+A) scenario. BiDialogueRNN+att = BiDialogueRNN+att in multimodal setting.

4.2 Baselines and State of the Art

For a comprehensive evaluation of DialogueRNN, we compare our model with the following baseline methods:

c-LSTM [Poria et al.2017]:

Biredectional LSTM [Hochreiter and Schmidhuber1997] is used to capture the context from the surrounding utterances to generate context-aware utterance representation. However, this model does not differentiate among the speakers.

c-LSTM+Att [Poria et al.2017]:

In this variant attention is applied applied to the c-LSTM output at each timestamp by following Eqs. 14 and 13. This provides better context to the final utterance representation.

Tfn [Zadeh et al.2017]:

This is specific to multimodal scenario. Tensor outer product is used to capture inter-modality and intra-modality interactions. This model does not capture context from surrounding utterances.

Mfn [Zadeh et al.2018a]:

Specific to multimodal scenario, this model utilizes multi-view learning by modeling view-specific and cross-view interactions. Similar to TFN, this model does not use contextual information.

Cnn [Kim2014]:

This is identical to our textual feature extractor network (Section 3.2) and it does not use contextual information from the surrounding utterances.

Memnet [Sukhbaatar et al.2015]:

As described in hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1, the current utterance is fed to a memory network, where the memories correspond to preceding utterances. The output from the memory network is used as the final utterance representation for emotion classification.

Cmn [Hazarika et al.2018]:

This state-of-the-art method models utterance context from dialogue history using two distinct GRUs for two speakers. Finally, utterance representation is obtained by feeding the current utterance as query to two distinct memory networks for both speakers.

4.3 Modalities

We evaluated our model primarily on textual modality. However, to substantiate efficacy of our model in multimodal scenario, we also experimented with multimodal features.

5 Results and Discussion

We compare DialogueRNN and its variants with the baselines for textual data in Table 2. As expected, on average DialogueRNN outperforms all the baseline methods, including the state-of-the-art CMN, on both of the datasets.

5.1 Comparison with the State of the Art

We compare the performance of DialogueRNN against the performance of the state-of-the-art CMN on IEMOCAP and AVEC datasets for textual modality.


As evidenced by Table 2, for IEMOCAP dataset, our model surpasses the state-of-the-art method CMN by accuracy and f1-score on average. We think that this enhancement is caused by the fundamental differences between CMN and DialogueRNN, which are

  1. party state modeling with in Eq. 5,

  2. speaker specific utterance treatment in Eqs. 1 and 5,

  3. and global state capturing with in Eq. 1.

Since we deal with six unbalanced emotion labels, we also explored the model performance for individual labels. DialogueRNN outperforms the state-of-the-art method CMN in five out of six emotion classes by significant margin. For frustrated class, DialogueRNN lags behind CMN by f1-score. We think that DialogueRNN may surpass CMN using a standalone classifier for frustrated class. However, it can be observed in Table 2 that some of the other variants of DialogueRNN, like BiDialogueRNN has already outperformed CMN for frustrated class.


DialogueRNN outperforms CMN for valence, arousal, expectancy, and power attributes; see Table 2. It yields significantly lower mean absolute error () and higher Pearson correlation coefficient () for all four attributes. We believe this to be due to the incorporation of party state and emotion GRU, which are missing from CMN.

5.2 DialogueRNN vs. DialogueRNN Variants

We discuss the performance of different DialogueRNN variants on IEMOCAP and AVEC datasets for textual modality.


Following Table 2, using explicit listener state update yields slightly worse performance than regular DialogueRNN. This is true for both IEMOCAP and AVEC datasets in general. However, the only exception to this trend is for happy emotion label for IEMOCAP, where DialogueRNN outperforms DialogueRNN by f1-score. We surmise that, this is due to the fact that a listener becomes relevant to the conversation only when he/she speaks. Now, in DialogueRNN, when a party speaks, we update his/her state with context which contains relevant information on all the preceding utterances, rendering explicit listener state update of DialogueRNN unnecessary.


Since BiDialogueRNN captures context from the future utterances, we expect improved performance from it over DialogueRNN. This is confirmed in Table 2, where BiDialogueRNN outperforms DialogueRNN on average on both datasets.


DialogueRNN+Attn also uses information from the future utterances. However, here we take information from both past and future utterances by matching them with the current utterance and calculating attention score over them. This provides relevance to emotionally important context utterances, yielding better performance than BiDialogueRNN. The improvement over BiDialogueRNN is f1-score for IEMOCAP and consistently lower and higher in AVEC.


Since this setting generates the final emotion representation by attending over the emotion representation from BiDialogueRNN, we expect better performance than both BiDialogueRNN and DialogueRNN+Attn. This is confirmed in Table 2, where this setting performs the best in general than any other methods discussed, on both datasets. This setting yields higher f1-score on average than the state-of-the-art CMN and higher f1-score than vanilla DialogueRNN for IEMOCAP dataset. For AVEC dataset also, this setting gives the best performance across all the four attributes.

5.3 Multimodal Setting

As both IEMOCAP and AVEC dataset contain multimodal information, we have evaluated DialogueRNN on multimodal features as used and provided by hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1. We use concatenation of the unimodal features as a fusion method by following hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1, since fusion mechanism is not a focus of this paper. Now, as we can see in Table 3, DialogueRNN significantly outperforms the strong baselines and state-of-the-art method CMN.

5.4 Case Studies

Dependency on preceding utterances (DialogueRNN)

One of the crucial components of DialogueRNN is its attention module over the outputs of global GRU (). Figure 2(b) shows the attention vector (Eq. 2) over the history of a given test utterance compared with the attention vector from the CMN model. The attention of our model is more focused compared with CMN: the latter gives diluted attention scores leading to misclassifications. We observe this trend of focused-attention across cases and posit that it can be interpreted as a confidence indicator. Further in this example, the test utterance by (turn 44) comprises of a change in emotion from neutral to frustrated. DialogueRNN anticipates this correctly by attending to turn 41 and 42 that are spoken by and , respectively. These two utterances provide self and inter-party influences that trigger the emotional shift. CMN, however, fails to capture such dependencies and wrongly predicts neutral emotion.

Dependency on future utterances (BiDialogueRNN+Att)

Fig. 2(a) visualizes the (Eq. 13) attention over the emotion representations for a segment of a conversation between a couple. In the discussion, the woman () is initially at a neutral state, whereas the man () is angry throughout. The figure reveals that the emotional attention of the woman is localized to the duration of her neutral state (turns 1-16 approximately). For example, in the dialogue, turns , and strongly attend to turn . Interestingly, turn attends to both past (turn ) and future (turn ) utterances. Similar trend across other utterances establish inter-dependence between emotional states of future and past utterances.

Figure 3: (a) Illustration of the attention over emotion representations ; (b) Comparison of attention scores over utterance history of CMN and DialogueRNN ( attention). (c) An example of long-term dependency among utterances. (d) Histogram of distance between the target utterance and its context utterance based on attention scores.

The beneficial consideration of future utterances through is also apparent through turns . These utterances focus on the distant future (turn ) where the man is at an enraged state, thus capturing emotional correlations across time. Although, turn is misclassified by our model, it still manages to infer a related emotional state (anger) against the correct state (frustrated). We analyze more of this trend in section 5.5.

Dependency on distant context

For all correct predictions in the IEMOCAP test set in Fig. 2(d) we summarize the distribution over the relative distance between test utterance and () highest attended utterance – either in the history or future – in the conversation. This reveals a decreasing trend with the highest dependence being within the local context. However, a significant portion of the test utterances (), attend to utterances that are to turns away from themselves, which highlights the important role of long-term emotional dependencies. Such cases primarily occur in conversations that maintain a specific affective tone and do not incur frequent emotional shifts. Fig. 2(c) demonstrates a case of long-term context dependency. The presented conversation maintains a happy mood throughout the dialogue. Although the turn comprising the sentence Horrible thing. I hated it. seems to be a negative expression, when seen with the global context, it reveals the excitement present in the speaker. To disambiguate such cases, our model attends to distant utterances in the past (turn , ) which serve as prototypes of the emotional tonality of the overall conversation.

5.5 Error Analysis

A noticeable trend in the predictions is the high level of cross-predictions amongst related emotions. Most of the misclassifications by the model for happy emotion are for excited class. Also, anger and frustrated share misclassifications amongst each other. We suspect this is due to subtle difference between those emotion pairs, resulting in harder disambiguation. Another class with high rate of false-positives is the neutral class. Primary reason for this could be its majority in the class distribution over the considered emotions.

At the dialogue level, we observe that a significant amount of errors occur at turns having a change of emotion from the previous turn of the same party. Across all the occurrences of these emotional-shifts in the testing set, our model correctly predicts instances. This stands less as compared to the success that it achieves at regions of no emotional-shift. Changes in emotions in a dialogue is a complex phenomenon governed by latent dynamics. Further improvement of these cases remain as an open area of research.

5.6 Ablation Study

The main novelty of our method is the introduction of party state and emotion GRU (). To comprehensively study the impact of these two components, we remove them one at a time and evaluate their impact on IEMOCAP.

Party State
Emotion GRU
- + 55.56
+ - 57.38
+ + 59.89
Table 4: Ablated DialogueRNN for IEMOCAP dataset.

As expected, following Table 4, party state stands very important, as without its presence the performance falls by . We suspect that party state helps in extracting useful contextual information relevant to parties’ emotion.

Emotion GRU is also impactful, but less than party state, as its absence causes performance to fall by only . We believe the reason to be the lack of context flow from the other parties’ states through the emotion representation of the preceding utterances.

6 Conclusion

We have presented an RNN-based neural architecture for emotion detection in a conversation. In contrast to the state-of-the-art method, CMN, our method treats each incoming utterance taking into account characteristics of the speaker, which gives finer context to the utterance. Our model outperforms the current state of the art on two distinct datasets in both textual and multimodal setting. Our method is designed to be scalable to multi-party setting with more than two speakers, though we could not test it due to unavailability of a multi-party conversation dataset with emotion labels. This is left to our future work.


  • [Alm, Roth, and Sproat2005] Alm, C. O.; Roth, D.; and Sproat, R. 2005.

    Emotions from text: machine learning for text-based emotion prediction.

    In Proceedings of the conference on human language technology and empirical methods in natural language processing, 579–586. Association for Computational Linguistics.
  • [Arriaga, Valdenegro-Toro, and Plöger2017] Arriaga, O.; Valdenegro-Toro, M.; and Plöger, P. 2017. Real-time convolutional neural networks for emotion and gender classification. CoRR abs/1710.07557.
  • [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • [Busso et al.2008] Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42(4):335–359.
  • [Chen et al.2017] Chen, M.; Wang, S.; Liang, P. P.; Baltrušaitis, T.; Zadeh, A.; and Morency, L.-P. 2017.

    Multimodal sentiment analysis with word-level fusion and reinforcement learning.

    In Proceedings of the 19th ACM International Conference on Multimodal Interaction, 163–171. ACM.
  • [Chung et al.2014] Chung, J.; Gülçehre, Ç.; Cho, K.; and Bengio, Y. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555.
  • [Datcu and Rothkrantz2008] Datcu, D., and Rothkrantz, L. 2008. Semantic audio-visual data fusion for automatic emotion recognition. Euromedia’2008.
  • [Ekman1993] Ekman, P. 1993. Facial expression and emotion. American psychologist 48(4):384.
  • [Eyben, Wöllmer, and Schuller2010] Eyben, F.; Wöllmer, M.; and Schuller, B. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, 1459–1462. ACM.
  • [Graves, Wayne, and Danihelka2014] Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.
  • [Hazarika et al.2018] Hazarika, D.; Poria, S.; Zadeh, A.; Cambria, E.; Morency, L.-P.; and Zimmermann, R. 2018. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2122–2132. New Orleans, Louisiana: Association for Computational Linguistics.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • [Kim2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980.
  • [Kumar et al.2016] Kumar, A.; Irsoy, O.; Ondruska, P.; Iyyer, M.; Bradbury, J.; Gulrajani, I.; Zhong, V.; Paulus, R.; and Socher, R. 2016. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, 1378–1387.
  • [McKeown et al.2012] McKeown, G.; Valstar, M.; Cowie, R.; Pantic, M.; and Schroder, M. 2012. The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent. IEEE Transactions on Affective Computing 3(1):5–17.
  • [Picard2010] Picard, R. W. 2010. Affective computing: From laughter to ieee. IEEE Transactions on Affective Computing 1(1):11–17.
  • [Poria et al.2017] Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; and Morency, L.-P. 2017. Context-Dependent Sentiment Analysis in User-Generated Videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 873–883. Vancouver, Canada: Association for Computational Linguistics.
  • [Richards, Butler, and Gross2003] Richards, J. M.; Butler, E. A.; and Gross, J. J. 2003. Emotion regulation in romantic relationships: The cognitive consequences of concealing feelings. Journal of Social and Personal Relationships 20(5):599–620.
  • [Ruusuvuori2013] Ruusuvuori, J. 2013. Emotion, affect and conversation. The handbook of conversation analysis 330–349.
  • [Schuller et al.2012] Schuller, B.; Valster, M.; Eyben, F.; Cowie, R.; and Pantic, M. 2012. AVEC 2012: The Continuous Audio/Visual Emotion Challenge. In Proceedings of the 14th ACM International Conference on Multimodal Interaction, ICMI ’12, 449–456. New York, NY, USA: ACM.
  • [Strapparava and Mihalcea2010] Strapparava, C., and Mihalcea, R. 2010. Annotating and identifying emotions in text. In Intelligent Information Access. Springer. 21–38.
  • [Sukhbaatar et al.2015] Sukhbaatar, S.; Szlam, A.; Weston, J.; and Fergus, R. 2015. End-to-end Memory Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, 2440–2448. Cambridge, MA, USA: MIT Press.
  • [Wöllmer et al.2010] Wöllmer, M.; Metallinou, A.; Eyben, F.; Schuller, B.; and Narayanan, S. S. 2010. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In INTERSPEECH 2010.
  • [Zadeh et al.2017] Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; and Morency, L.-P. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1103–1114. Copenhagen, Denmark: Association for Computational Linguistics.
  • [Zadeh et al.2018a] Zadeh, A.; Liang, P. P.; Mazumder, N.; Poria, S.; Cambria, E.; and Morency, L.-P. 2018a. Memory Fusion Network for Multi-view Sequential Learning. In AAAI Conference on Artificial Intelligence, 5634–5641.
  • [Zadeh et al.2018b] Zadeh, A.; Liang, P. P.; Poria, S.; Vij, P.; Cambria, E.; and Morency, L.-P. 2018b. Multi-attention recurrent network for human communication comprehension. In AAAI Conference on Artificial Intelligence, 5642–5649.

7 Supplementary Material

Hyperparameter DialogueRNN BiDialogueRNN DialogueRNN+Att BiDialogueRNN+Att
300 150 150 150
400 150 150 150
400 100 100 100
200 100 100 100
0.0001 0.0001 0.0001 0.0001
0.00001 0.00001 0.00001 0.00001
Table 5: Hyperparameter values for DialogueRNN variants; = learning rate.

7.1 GRU Details

We use GRU cells are defined as , where:


is the current input, is the previous GRU output, and is the current GRU output. and are GRU parameters. , are refresh gate and update gate respectively. is the candidate output. stands for hadamard product.

2:procedure DialogueRNN(, ) =utterances in the conversation, =speakers
3:      Initialize the participant states with null vector:
4:      for i:[1,] do
6:      Set the initial global and emotional state as null vector:
9:      Pass the dialogue through RNN:
10:      for t:[1,] do
12:      return
14:procedure DialogueCell(, , , , )
15:      Update global state:
17:      Get context from preceding global states:
19:      Update participant states:
20:      for i:[1,] do
21:            if  then
22:                 Update speaker state:
24:            else
25:                 Update listener state:
27:      Update emotion representation:
29:      return
Algorithm 1 DialogueRNN algorithm
Figure 4: In this dialogue, changes emotion influenced by the behavior of .
Figure 5: DialogueRNN architecture.
Figure 6: Update schemes for global, speaker, listener and emotion states for utterance in a dialogue. Here, Person is the speaker and Persons are the listeners.
Figure 7: Illustration of the attention over emotion representations .
Figure 8: Comparison of attention scores over utterance history of CMN and DialogueRNN ( attention).
Figure 9: An example of long-term dependency among utterances.
Figure 10: Histogram of distance between the target utterance and its context utterance based on attention scores.