This repo contains implementation of several architectures for emotion recognition in conversations
Emotion detection in conversations is a necessary step for a number of applications, including opinion mining over chat history, social media threads, debates, argumentation mining, understanding consumer feedback in live conversations, etc. Currently, systems do not treat the parties in the conversation individually by adapting to the speaker of each utterance. In this paper, we describe a new method based on recurrent neural networks that keeps track of the individual party states throughout the conversation and uses this information for emotion classification. Our model outperforms the state of the art by a significant margin on two different datasets.READ FULL TEXT VIEW PDF
Emotion recognition in conversations is an important step in various vir...
Efficient discovery of emotion states of speakers in a multi-party
Messages in human conversations inherently convey emotions. The task of
Several recent studies on dyadic human-human interactions have been done...
In this paper, we address the task of utterance level emotion recognitio...
Emotion Recognition in Conversations (ERC) aims to predict the emotional...
Nowadays, automatical personality inference is drawing extensive attenti...
This repo contains implementation of several architectures for emotion recognition in conversations
Emotion detection in conversation attracts increasing attention of the community due to its applications in many important tasks such as opinion mining over chat history and social media threads in YouTube, Facebook, Twitter, etc. In this paper, we present a method based on recurrent neural network (RNN) that can cater to these needs by processing the huge amount of available conversational data.
Current systems, including the state of the art [Hazarika et al.2018], do not distinguish different parties in a conversation in a meaningful way. They are not aware of the speaker of a given utterance. In contrast, we model individual party with party states, as the conversation flows, basing on the utterance, the context, and current party state. Our model is based on the assumption that there are three major aspects relevant to the emotion in a conversation: the speaker, the context from the preceding utterances, and the emotion of the preceding utterances. These three aspects are not necessarily independent, but their separate modeling significantly outperforms the state of the art (Table 2
). In dyadic conversations, the parties have distinct roles. Hence, to extract the context, it is crucial to consider the preceding turns of both speaker and listener at a given moment (Fig. 1).
Our DialogueRNN employs three gated recurrent units (GRU)[Chung et al.2014] to model these aspects. The incoming utterance is fed into two GRUs called global GRU and party GRU to update the context and party state, respectively. The global GRU encodes corresponding party information while encoding an utterance.
Attending over this GRU gives contextual representation that has information of all preceding utterances by different parties in the conversation. The speaker state depends on this context through attention and the speaker’s previous state. This ensures that at time , the speaker state directly gets information from the speaker’s previous state and global GRU which has information on the preceding parties. Finally, the updated speaker state is fed into the emotion GRU to decode the emotion representation of the given utterance, which is used for emotion classification. At time , emotion GRU cell gets the emotion representation of and speaker state of .
The emotion GRU, along with the global GRU, plays a pivotal role in inter-party relation modeling. On the other hand, party GRU models relation between two sequential states of the same party. In DialogueRNN, all these three different types of GRUs are connected in a recurrent manner. We believe that DialogueRNN outperforms state-of-the-art contextual emotion classifiers such as[Hazarika et al.2018, Poria et al.2017] because of better context representation.
Emotion recognition has attracted attention in various fields such as natural language processing, psychology, cognitive science, and so on[Picard2010]. ekman1993facial ekman1993facial found correlation between emotion and facial cues.
datcu2008semantic datcu2008semantic fused acoustic information with visual cues for emotion recognition. alm2005emotions alm2005emotions introduced text-based emotion recognition, developed in the work of strapparava2010annotating strapparava2010annotating. wollmer2010context wollmer2010context used contextual information for emotion recognition in multimodal setting. Recently, poria-EtAl:2017:Long poria-EtAl:2017:Long successfully used RNN-based deep networks for multimodal emotion recognition, which was followed by other works [Chen et al.2017, Zadeh et al.2018a, Zadeh et al.2018b].
Reproducing human interaction requires deep understanding of conversation. ruusuvuori2013emotion ruusuvuori2013emotion states that emotion plays a pivotal role in conversations. It has been argued that emotional dynamics in a conversation is an inter-personal phenomenon [Richards, Butler, and Gross2003]. Hence, our model incorporates inter-personal interactions in an effective way. Further, since conversations have a natural temporal nature, we adopt the temporal nature through recurrent network [Poria et al.2017].
Memory networks [Sukhbaatar et al.2015] has been successful in several NLP areas, including question answering [Sukhbaatar et al.2015, Kumar et al.2016], machine translation [Bahdanau, Cho, and Bengio2014], speech recognition [Graves, Wayne, and Danihelka2014], and so on. Thus, hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1 used memory networks for emotion recognition in dyadic conversations, where two distinct memory networks enabled inter-speaker interaction, yielding state-of-the-art performance.
Let there be parties/participants ( for the datasets we used) in a conversation. The task is to predict the emotion labels (happy, sad, neutral, angry, excited, and frustrated) of the constituent utterances , where utterance is uttered by party , while being the mapping between utterance and index of its corresponding party. Also, is the utterance representation, obtained using feature extractors described below.
For a fair comparison with the state-of-the-art method, conversational memory networks (CMN) [Hazarika et al.2018]
, we follow identical feature extraction procedures.
We employ convolutional neural networks (CNN) for textual feature extraction. Following kim2014convolutional kim2014convolutional, we obtain n-gram features from each utterance using three distinct convolution filters of sizes 3, 4, and 5 respectively, each having 50 feature-maps. Outputs are then subjected to max-pooling followed by rectified linear unit (ReLU) activation. These activations are concatenated and fed to adimensional dense layer, which is regarded as the textual utterance representation. This network is trained at utterance level with the emotion labels.
Identical to hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1, we use 3D-CNN and openSMILE [Eyben, Wöllmer, and Schuller2010] for visual and acoustic feature extraction, respectively.
We assume that the emotion of an utterance in a conversation depends on three major factors:
the context given by the preceding utterances.
the emotion behind the preceding utterances.
Our model DialogueRNN,111Implementation available at https://github.com/senticnet/conv-emotion shown in Fig. 1(a), models these three factors as follows: each party is modeled using a party state which changes as and when that party utters an utterance. This enables the model to track the parties’ emotion dynamics through the conversations, which is related to the emotion behind the utterances. Furthermore, the context of an utterance is modeled using a global state (called global, because of being shared among the parties), where the preceding utterances and the party states are jointly encoded for context representation, necessary for accurate party state representation. Finally, the model infers emotion representation from the party state of the speaker along with the preceding speakers’ states as context. This emotion representation is used for the final emotion classification.
We use GRU cells [Chung et al.2014] to update the states and representations. Each GRU cell computes a hidden state defined as , where is the current input and is the previous GRU state. also serves as the current GRU output. We provide the GRU computation details in the supplementary. GRUs are efficient networks with trainable parameters: and .
We model the emotion representation of the current utterance as a function of the emotion representation of the previous utterance and the state of the current speaker. Finally, this emotion representation is sent to a softmax layer for emotion classification.
Global state aims to capture the context of a given utterance by jointly encoding utterance and speaker state. Each state also serves as speaker-specific utterance representation. Attending on these states facilitates the inter-speaker and inter-utterance dependencies to produce improved context representation. The current utterance changes the speaker’s state from to . We capture this change with GRU cell with output size , using and :
is the size of global state vector,is the size of party state vector, , , , , , is party state size, and represents concatenation.
DialogueRNN keeps track of the state of individual speakers using fixed size vectors through out the conversation. These states are representative of the speakers’ state in the conversation, relevant to emotion classification. We update these states based on the current (at time ) role of a participant in the conversation, which is either speaker or listener, and the incoming utterance . These state vectors are initialized with null vectors for all the participants. The main purpose of this module is to ensure that the model is aware of the speaker of each utterance and handle it accordingly.
Speaker usually frames the response based on the context, which is the preceding utterances in the conversation. Hence, we capture context relevant to the utterance as follows:
where are preceding global states (), , , and . In Eq. 2, we calculate attention scores over the previous global states representative of the previous utterances. This assigns higher attention scores to the utterances emotionally relevant to . Finally, in Eq. 4 the context vector is calculated by pooling the previous global states with .
Now, we employ a GRU cell to update the current speaker state to the new state based on incoming utterance and context using GRU cell of output size
where , , , and . This encodes the information on the current utterance along with its context from the global GRU into the speaker’s state , which helps in emotion classification down the line.
Listener state models the listeners’ change of state due to the speaker’s utterance. We tried two listener state update mechanisms:
Simply keep the state of the listener unchanged, that is
Employ another GRU cell to update the listener state based on listener visual cues (facial expression) and its context , as
where , , , and . Listener visual features of party at time are extracted using the model introduced by DBLP:journals/corr/abs-1710-07557 DBLP:journals/corr/abs-1710-07557, pretrained on FER2013 dataset, where feature size .
The simpler first approach turns out to be sufficient, since the second approach yields very similar result while increasing number of parameters. This is due to the fact that a listener becomes relevant to the conversation only when he/she speaks. In other words, a silent party has no influence in a conversation. Now, when a party speaks, we update his/her state with context which contains relevant information on all the preceding utterances, rendering explicit listener state update unnecessary. This is shown in Table 2.
We infer the emotionally relevant representation of utterance from the speaker’s state and the emotion representation of the previous utterance . Since context is important to the emotion of the incoming utterance , feeds fine-tuned emotionally relevant contextual information from other the party states into the emotion representation . This establishes a connection between the speaker state and the other party states. Hence, we model with a GRU cell () with output size as
where is the size of emotion representation vector, , , , and .
Since speaker state gets information from global states, which serve as speaker-specific utterance representation, one may claim that this way the model already has access to the information on other parties. However, as shown in the ablation study (Section 5.6) emotion GRU helps to improve the performance by directly linking states of preceding parties. Further, we believe that speaker and global GRUs (, ) jointly act similar to an encoder, whereas emotion GRU serves as a decoder.
We use categorical cross-entropy along with L2-regularization as the measure of loss () during training:
where is the number of samples/dialogues, is the number of utterances in sample ,
is the probability distribution of emotion labels for utteranceof dialogue , is the expected class label of utterance of dialogue , is the L2-regularizer weight, and is the set of trainable parameters where
We use DialogueRNN (Section 3.3) as the basis for the following models:
This variant updates the listener state based on the the resulting speaker state , as described in Eq. 7.
Bidirectional DialogueRNN is analogous to bidirectional RNNs, where two different RNNs are used for forward and backward passes of the input sequence. Outputs from the RNNs are concatenated in sequence level. Similarly, in BiDialogueRNN, the final emotion representation contains information from both past and future utterances in the dialogue through forward and backward DialogueRNNs respectively, which provides better context for emotion classification.
For each emotion representation of BiDialogueRNN, attention is applied over all the emotion representations in the dialogue to capture context from the other utterances in dialogue:
We use two emotion detection datasets IEMOCAP [Busso et al.2008] and AVEC [Schuller et al.2012] to evaluate DialogueRNN. We partition both datasets into train and test sets with roughly ratio such that the partitions do not share any speaker. Table 1 shows the distribution of train and test samples for both dataset.
|IEMOCAP||train + val||5810||120|
|AVEC||train + val||4368||63|
IEMOCAP [Busso et al.2008] dataset contains videos of two-way conversations of ten unique speakers, where only the first eight speakers from session one to four belong to the train-set. Each video contains a single dyadic dialogue, segmented into utterances. The utterances are annotated with one of six emotion labels, which are happy, sad, neutral, angry, excited, and frustrated.
containing interactions between humans and artificially intelligent agents. Each utterance of a dialogue is annotated with four real valued affective attributes: valence (), arousal (), expectancy (), and power (). The annotations are available every 0.2 seconds in the original database. However, in order to adapt the annotations to our need of utterance-level annotation, we averaged the attributes over the span of an utterance.
For a comprehensive evaluation of DialogueRNN, we compare our model with the following baseline methods:
Biredectional LSTM [Hochreiter and Schmidhuber1997] is used to capture the context from the surrounding utterances to generate context-aware utterance representation. However, this model does not differentiate among the speakers.
This is specific to multimodal scenario. Tensor outer product is used to capture inter-modality and intra-modality interactions. This model does not capture context from surrounding utterances.
Specific to multimodal scenario, this model utilizes multi-view learning by modeling view-specific and cross-view interactions. Similar to TFN, this model does not use contextual information.
This is identical to our textual feature extractor network (Section 3.2) and it does not use contextual information from the surrounding utterances.
As described in hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1, the current utterance is fed to a memory network, where the memories correspond to preceding utterances. The output from the memory network is used as the final utterance representation for emotion classification.
This state-of-the-art method models utterance context from dialogue history using two distinct GRUs for two speakers. Finally, utterance representation is obtained by feeding the current utterance as query to two distinct memory networks for both speakers.
We evaluated our model primarily on textual modality. However, to substantiate efficacy of our model in multimodal scenario, we also experimented with multimodal features.
We compare DialogueRNN and its variants with the baselines for textual data in Table 2. As expected, on average DialogueRNN outperforms all the baseline methods, including the state-of-the-art CMN, on both of the datasets.
We compare the performance of DialogueRNN against the performance of the state-of-the-art CMN on IEMOCAP and AVEC datasets for textual modality.
As evidenced by Table 2, for IEMOCAP dataset, our model surpasses the state-of-the-art method CMN by accuracy and f1-score on average. We think that this enhancement is caused by the fundamental differences between CMN and DialogueRNN, which are
Since we deal with six unbalanced emotion labels, we also explored the model performance for individual labels. DialogueRNN outperforms the state-of-the-art method CMN in five out of six emotion classes by significant margin. For frustrated class, DialogueRNN lags behind CMN by f1-score. We think that DialogueRNN may surpass CMN using a standalone classifier for frustrated class. However, it can be observed in Table 2 that some of the other variants of DialogueRNN, like BiDialogueRNN has already outperformed CMN for frustrated class.
DialogueRNN outperforms CMN for valence, arousal, expectancy, and power attributes; see Table 2. It yields significantly lower mean absolute error () and higher Pearson correlation coefficient () for all four attributes. We believe this to be due to the incorporation of party state and emotion GRU, which are missing from CMN.
We discuss the performance of different DialogueRNN variants on IEMOCAP and AVEC datasets for textual modality.
Following Table 2, using explicit listener state update yields slightly worse performance than regular DialogueRNN. This is true for both IEMOCAP and AVEC datasets in general. However, the only exception to this trend is for happy emotion label for IEMOCAP, where DialogueRNN outperforms DialogueRNN by f1-score. We surmise that, this is due to the fact that a listener becomes relevant to the conversation only when he/she speaks. Now, in DialogueRNN, when a party speaks, we update his/her state with context which contains relevant information on all the preceding utterances, rendering explicit listener state update of DialogueRNN unnecessary.
Since BiDialogueRNN captures context from the future utterances, we expect improved performance from it over DialogueRNN. This is confirmed in Table 2, where BiDialogueRNN outperforms DialogueRNN on average on both datasets.
DialogueRNN+Attn also uses information from the future utterances. However, here we take information from both past and future utterances by matching them with the current utterance and calculating attention score over them. This provides relevance to emotionally important context utterances, yielding better performance than BiDialogueRNN. The improvement over BiDialogueRNN is f1-score for IEMOCAP and consistently lower and higher in AVEC.
Since this setting generates the final emotion representation by attending over the emotion representation from BiDialogueRNN, we expect better performance than both BiDialogueRNN and DialogueRNN+Attn. This is confirmed in Table 2, where this setting performs the best in general than any other methods discussed, on both datasets. This setting yields higher f1-score on average than the state-of-the-art CMN and higher f1-score than vanilla DialogueRNN for IEMOCAP dataset. For AVEC dataset also, this setting gives the best performance across all the four attributes.
As both IEMOCAP and AVEC dataset contain multimodal information, we have evaluated DialogueRNN on multimodal features as used and provided by hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1. We use concatenation of the unimodal features as a fusion method by following hazarika-EtAl:2018:N18-1 hazarika-EtAl:2018:N18-1, since fusion mechanism is not a focus of this paper. Now, as we can see in Table 3, DialogueRNN significantly outperforms the strong baselines and state-of-the-art method CMN.
One of the crucial components of DialogueRNN is its attention module over the outputs of global GRU (). Figure 2(b) shows the attention vector (Eq. 2) over the history of a given test utterance compared with the attention vector from the CMN model. The attention of our model is more focused compared with CMN: the latter gives diluted attention scores leading to misclassifications. We observe this trend of focused-attention across cases and posit that it can be interpreted as a confidence indicator. Further in this example, the test utterance by (turn 44) comprises of a change in emotion from neutral to frustrated. DialogueRNN anticipates this correctly by attending to turn 41 and 42 that are spoken by and , respectively. These two utterances provide self and inter-party influences that trigger the emotional shift. CMN, however, fails to capture such dependencies and wrongly predicts neutral emotion.
Fig. 2(a) visualizes the (Eq. 13) attention over the emotion representations for a segment of a conversation between a couple. In the discussion, the woman () is initially at a neutral state, whereas the man () is angry throughout. The figure reveals that the emotional attention of the woman is localized to the duration of her neutral state (turns 1-16 approximately). For example, in the dialogue, turns , and strongly attend to turn . Interestingly, turn attends to both past (turn ) and future (turn ) utterances. Similar trend across other utterances establish inter-dependence between emotional states of future and past utterances.
The beneficial consideration of future utterances through is also apparent through turns . These utterances focus on the distant future (turn ) where the man is at an enraged state, thus capturing emotional correlations across time. Although, turn is misclassified by our model, it still manages to infer a related emotional state (anger) against the correct state (frustrated). We analyze more of this trend in section 5.5.
For all correct predictions in the IEMOCAP test set in Fig. 2(d) we summarize the distribution over the relative distance between test utterance and () highest attended utterance – either in the history or future – in the conversation. This reveals a decreasing trend with the highest dependence being within the local context. However, a significant portion of the test utterances (), attend to utterances that are to turns away from themselves, which highlights the important role of long-term emotional dependencies. Such cases primarily occur in conversations that maintain a specific affective tone and do not incur frequent emotional shifts. Fig. 2(c) demonstrates a case of long-term context dependency. The presented conversation maintains a happy mood throughout the dialogue. Although the turn comprising the sentence Horrible thing. I hated it. seems to be a negative expression, when seen with the global context, it reveals the excitement present in the speaker. To disambiguate such cases, our model attends to distant utterances in the past (turn , ) which serve as prototypes of the emotional tonality of the overall conversation.
A noticeable trend in the predictions is the high level of cross-predictions amongst related emotions. Most of the misclassifications by the model for happy emotion are for excited class. Also, anger and frustrated share misclassifications amongst each other. We suspect this is due to subtle difference between those emotion pairs, resulting in harder disambiguation. Another class with high rate of false-positives is the neutral class. Primary reason for this could be its majority in the class distribution over the considered emotions.
At the dialogue level, we observe that a significant amount of errors occur at turns having a change of emotion from the previous turn of the same party. Across all the occurrences of these emotional-shifts in the testing set, our model correctly predicts instances. This stands less as compared to the success that it achieves at regions of no emotional-shift. Changes in emotions in a dialogue is a complex phenomenon governed by latent dynamics. Further improvement of these cases remain as an open area of research.
The main novelty of our method is the introduction of party state and emotion GRU (). To comprehensively study the impact of these two components, we remove them one at a time and evaluate their impact on IEMOCAP.
As expected, following Table 4, party state stands very important, as without its presence the performance falls by . We suspect that party state helps in extracting useful contextual information relevant to parties’ emotion.
Emotion GRU is also impactful, but less than party state, as its absence causes performance to fall by only . We believe the reason to be the lack of context flow from the other parties’ states through the emotion representation of the preceding utterances.
We have presented an RNN-based neural architecture for emotion detection in a conversation. In contrast to the state-of-the-art method, CMN, our method treats each incoming utterance taking into account characteristics of the speaker, which gives finer context to the utterance. Our model outperforms the current state of the art on two distinct datasets in both textual and multimodal setting. Our method is designed to be scalable to multi-party setting with more than two speakers, though we could not test it due to unavailability of a multi-party conversation dataset with emotion labels. This is left to our future work.
Emotions from text: machine learning for text-based emotion prediction.In Proceedings of the conference on human language technology and empirical methods in natural language processing, 579–586. Association for Computational Linguistics.
We use GRU cells are defined as , where:
is the current input, is the previous GRU output, and is the current GRU output. and are GRU parameters. , are refresh gate and update gate respectively. is the candidate output. stands for hadamard product.