Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations

09/24/2019 ∙ by Peixiang Zhong, et al. ∙ Nanyang Technological University 0

Messages in human conversations inherently convey emotions. The task of detecting emotions in textual conversations leads to a wide range of applications such as opinion mining in social networks. However, enabling machines to analyze emotions in conversations is challenging, partly because humans often rely on the context and commonsense knowledge to express emotions. In this paper, we address these challenges by proposing a Knowledge-Enriched Transformer (KET), where contextual utterances are interpreted using hierarchical self-attention and external commonsense knowledge is dynamically leveraged using a context-aware affective graph attention mechanism. Experiments on multiple textual conversation datasets demonstrate that both context and commonsense knowledge are consistently beneficial to the emotion detection performance. In addition, the experimental results show that our KET model outperforms the state-of-the-art models on most of the tested datasets in F1 score.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Emotions are “generated states in humans that reflect evaluative judgments of the environment, the self and other social agents” Hudlicka (2011)

. Messages in human communications inherently convey emotions. With the prevalence of social media platforms such as Facebook Messenger, as well as conversational agents such as Amazon Alexa, there is an emerging need for machines to understand human emotions in natural conversations. This work addresses the task of detecting emotions (e.g., happy, sad, angry, etc.) in textual conversations, where the emotion of an utterance is detected in the conversational context. Being able to effectively detect emotions in conversations leads to a wide range of applications ranging from opinion mining in social media platforms

Chatterjee et al. (2019) to building emotion-aware conversational agents Zhou et al. (2018a).

However, enabling machines to analyze emotions in human conversations is challenging, partly because humans often rely on the context and commonsense knowledge to express emotions, which is difficult to be captured by machines. Figure 1 shows an example conversation demonstrating the importance of context and commonsense knowledge in understanding conversations and detecting implicit emotions.

Figure 1: An example conversation with annotated labels from the DailyDialog dataset Li et al. (2017). By referring to the context, “it” in the third utterance is linked to “birthday” in the first utterance. By leveraging an external knowledge base, the meaning of “friends” in the forth utterance is enriched by associated knowledge entities, namely “socialize”, “party”, and “movie”. Thus, the implicit “happiness” emotion in the fourth utterance can be inferred more easily via its enriched meaning.

There are several recent studies that model contextual information to detect emotions in conversations. Poria et al. (2017) and Majumder et al. (2019)

leveraged recurrent neural networks (RNN) to model the contextual utterances in sequence, where each utterance is represented by a feature vector extracted by convolutional neural networks (CNN) at an earlier stage. Similarly,

Hazarika et al. (2018a, b)

proposed to use extracted CNN features in memory networks to model contextual utterances. However, these methods require separate feature extraction and tuning, which may not be ideal for real-time applications. In addition, to the best of our knowledge, no attempts have been made in the literature to incorporate commonsense knowledge from external knowledge bases to detect emotions in textual conversations. Commonsense knowledge is fundamental to understanding conversations and generating appropriate responses

Zhou et al. (2018b).

To this end, we propose a Knowledge-Enriched Transformer (KET) to effectively incorporate contextual information and external knowledge bases to address the aforementioned challenges. The Transformer Vaswani et al. (2017) has been shown to be a powerful representation learning model in many NLP tasks such as machine translation Vaswani et al. (2017) and language understanding Devlin et al. (2018). The self-attention Cheng et al. (2016) and cross-attention Bahdanau et al. (2014) modules in the Transformer capture the intra-sentence and inter-sentence correlations, respectively. The shorter path of information flow in these two modules compared to gated RNNs and CNNs allows KET to model contextual information more efficiently. In addition, we propose a hierarchical self-attention mechanism allowing KET to model the hierarchical structure of conversations. Our model separates context and response into the encoder and decoder, respectively, which is different from other Transformer-based models, e.g., BERT Devlin et al. (2018), which directly concatenate context and response, and then train language models using only the encoder part.

Moreover, to exploit commonsense knowledge, we leverage external knowledge bases to facilitate the understanding of each word in the utterances by referring to related knowledge entities. The referring process is dynamic and balances between relatedness and affectiveness of the retrieved knowledge entities using a context-aware affective graph attention mechanism.

In summary, our contributions are as follows:

  • For the first time, we apply the Transformer to analyze conversations and detect emotions. Our hierarchical self-attention and cross-attention modules allow our model to exploit contextual information more efficiently than existing gated RNNs and CNNs.

  • We derive dynamic, context-aware, and emotion-related commonsense knowledge from external knowledge bases and emotion lexicons to facilitate the emotion detection in conversations.

  • We conduct extensive experiments demonstrating that both contextual information and commonsense knowledge are beneficial to the emotion detection performance. In addition, our proposed KET model outperforms the state-of-the-art models on most of the tested datasets across different domains.

2 Related Work

Emotion Detection in Conversations: Early studies on emotion detection in conversations focus on call center dialogs using lexicon-based methods and audio features Lee and Narayanan (2005); Devillers and Vidrascu (2006). Devillers et al. (2002)

annotated and detected emotions in call center dialogs using unigram topic modelling. In recent years, there is an emerging research trend on emotion detection in conversational videos and multi-turn Tweets using deep learning methods

Hazarika et al. (2018b, a); Zahiri and Choi (2018); Chatterjee et al. (2019); Zhong and Miao (2019); Poria et al. (2019). Poria et al. (2017)

proposed a long short-term memory network (LSTM)

Hochreiter and Schmidhuber (1997)

based model to capture contextual information for sentiment analysis in user-generated videos.

Majumder et al. (2019)

proposed the DialogueRNN model that uses three gated recurrent units (GRU)

Cho et al. (2014) to model the speaker, the context from the preceding utterances, and the emotions of the preceding utterances, respectively. They achieved the state-of-the-art performance on several conversational video datasets.

Knowledge Base in Conversations: Recently there is a growing number of studies on incorporating knowledge base in generative conversation systems, such as open-domain dialogue systems Han et al. (2015); Asghar et al. (2018); Ghazvininejad et al. (2018); Young et al. (2018); Parthasarathi and Pineau (2018); Liu et al. (2018); Moghe et al. (2018); Dinan et al. (2019); Zhong et al. (2019), task-oriented dialogue systems Madotto et al. (2018); Wu et al. (2019); He et al. (2019) and question answering systems Kiddon et al. (2016); Hao et al. (2017); Sun et al. (2018); Mihaylov and Frank (2018). Zhou et al. (2018b)

adopted structured knowledge graphs to enrich the interpretation of input sentences and help generate knowledge-aware responses using graph attentions. The graph attention in the knowledge interpreter

Zhou et al. (2018b) is static and only related to the recognized entity of interest. By contrast, our graph attention mechanism is dynamic and selects context-aware knowledge entities that balances between relatedness and affectiveness.

Emotion Detection in Text:

There is a trend moving from traditional machine learning methods

Pang et al. (2002); Wang and Manning (2012); Seyeditabari et al. (2018) to deep learning methods Abdul-Mageed and Ungar (2017); Zhang et al. (2018b) for emotion detection in text. Khanpour and Caragea (2018) investigated the emotion detection from health-related posts in online health communities using both deep learning features and lexicon-based features.

Incorporating Knowledge in Sentiment Analysis: Traditional lexicon-based methods detect emotions or sentiments from a piece of text based on the emotions or sentiments of words or phrases that compose it Hu et al. (2009); Taboada et al. (2011); Bandhakavi et al. (2017). Few studies investigated the usage of knowledge bases in deep learning methods. Kumar et al. (2018) proposed to use knowledge from WordNet Fellbaum (2012) to enrich the text representations produced by LSTM and obtained improved performance.

Transformer: The Transformer has been applied to many NLP tasks due to its rich representation and fast computation, e.g., document machine translation Zhang et al. (2018a), response matching in dialogue system Zhou et al. (2018c), language modelling Dai et al. (2019) and understanding Radford et al. (2018). A very recent work Hajishirzi (2019)

extends the Transformer to graph inputs and propose a model for graph-to-text generation.

Figure 2:

Overall architecture of our proposed KET model. The positional encoding, residual connection, and layer normalization are omitted in the illustration for brevity.

3 Our Proposed KET Model

In this section we present the task definition and our proposed KET model.

3.1 Task Definition

Let be a collection of {utterance, label} pairs in a given dialogue dataset, where denotes the number of conversations and denotes the number of utterances in the th conversation. The objective of the task is to maximize the following function:


where denote contextual utterances and denotes the model parameters we want to optimize.

We limit the number of contextual utterances to . Discarding early contextual utterances may cause information loss, but this loss is negligible because they only contribute the least amount of information Su et al. (2018). This phenomenon can be further observed in our model analysis regarding context length (see Section 5.2). Similar to Poria et al. (2017)

, we clip and pad each utterance

to a fixed number of tokens. The overall architecture of our KET model is illustrated in Figure 2.

3.2 Knowledge Retrieval

We use a commonsense knowledge base ConceptNet Speer et al. (2017) and an emotion lexicon NRC_VAD Mohammad (2018b) as knowledge sources in our model.

ConceptNet is a large-scale multilingual semantic graph that describes general human knowledge in natural language. The nodes in ConceptNet are concepts and the edges are relations. Each concept1, relation, concept2 triplet is an assertion. Each assertion is associated with a confidence score. An example assertion is friends, CausesDesire, socialize with confidence score of 3.46. Usually assertion confidence scores are in the interval. Currently, for English, ConceptNet comprises 5.9M assertions, 3.1M concepts and 38 relations.

NRC_VAD is a list of English words and their VAD scores, i.e., valence (negative-positive), arousal (calm-excited), and dominance (submissive-dominant) scores in the interval. The VAD measure of emotion is culture-independent and widely adopted in Psychology Mehrabian (1996). Currently NRC_VAD comprises around 20K words.

In general, for each non-stopword token in , we retrieve a connected knowledge graph comprising its immediate neighbors from ConceptNet. For each , we remove concepts that are stopwords or not in our vocabulary. We further remove concepts with confidence scores less than 1 to reduce annotation noises. For each concept, we retrieve its VAD values from NRC_VAD. The final knowledge representation for each token is a list of tuples: , , …, , where denotes the th connected concept, denotes the associated confidence score, and denotes the VAD values of . The treatment for tokens that are not associated with any concept and concepts that are not included in NRC_VAD are discussed in Section 3.4. We leave the treatment on relations as future work.

3.3 Embedding Layer

We use a word embedding layer to convert each token in into a vector representation , where denotes the size of word embedding. To encode positional information, the position encoding Vaswani et al. (2017) is added as follows:


Similarly, we use a concept embedding layer to convert each concept into a vector representation but without position encoding.

3.4 Dynamic Context-Aware Affective Graph Attention

To enrich word embedding with concept representations, we propose a dynamic context-aware affective graph attention mechanism to compute the concept representation for each token. Specifically, the concept representation for token is computed as


where denotes the concept embedding of and denotes its attention weight. If , we set c(t) to the average of all concept embeddings. The attention in Equation 3 is computed as


where denotes the weight of .

The derivation of is crucial because it regulates the contribution of towards enriching . A standard graph attention mechanism veličković2018graph computes by feeding and into a single-layer feedforward neural network. However, not all related concepts are equal in detecting emotions given the conversational context. In our model, we make the assumption that important concepts are those that relate to the conversational context and have strong emotion intensity. To this end, we propose a context-aware affective graph attention mechanism by incorporating two factors when computing , namely relatedness and affectiveness.

Relatedness: Relatedness measures the strength of the relation between and the conversational context. The relatedness factor in is computed as


where is the confidence score introduced in Section 3.2, min-max denotes min-max scaling for each token , abs denotes the absolute function,

denotes the cosine similarity function, and

denotes the context representation of the th conversation . Here we compute as the average of all sentence representations in as follows:


where denotes the sentence representation of . We compute via hierarchical pooling Shen et al. (2018) where -gram () representations in

are first computed by max-pooling and then all

-gram representations are averaged. The hierarchical pooling mechanism preserves word order information to certain degree and has demonstrated superior performance than average pooling or max-pooling on sentiment analysis tasks Shen et al. (2018).

Affectiveness: Affectiveness measures the emotion intensity of . The affectiveness factor in is computed as


where denotes norm, and denote the valence and arousal values of , respectively. Intuitively, considers the deviations of valence from neutral and the level of arousal from calm. There is no established method in the literature to compute the emotion intensity based on VAD values, but empirically we found that our method correlates better with an emotion intensity lexicon comprising 6K English words Mohammad (2018a) than other methods such as taking dominance into consideration or taking norm. For concept not in NRC_VAD, we set to the mid value of .

Combining both and , we define the weight as follows:


where is a model parameter balancing the impacts of relatedness and affectiveness on computing concept representations. Parameter can be fixed or learned during training. The analysis of is discussed in Section 5.

Finally, the concept-enriched word representation

can be obtained via a linear transformation:


where denotes concatenation and denotes a model parameter. All tokens in each then form a concept-enriched utterance embedding .

3.5 Hierarchical Self-Attention

Dataset Domain #Conv. (Train/Val/Test) #Utter. (Train/Val/Test) #Classes Evaluation
EC Tweet 30160/2755/5509 90480/8265/16527 4 Micro-F1
DailyDialog Daily Communication 11118/1000/1000 87170/8069/7740 7 Micro-F1
MELD TV Show Scripts 1038/114/280 9989/1109/2610 7 Weighted-F1
EmoryNLP TV Show Scripts 659/89/79 7551/954/984 7 Weighted-F1
IEMOCAP Emotional Dialogues 100/20/31 4810/1000/1523 6 Weighted-F1
Table 1: Dataset descriptions.

We propose a hierarchical self-attention mechanism to exploit the structural representation of conversations and learn a vector representation for the contextual utterances . Specifically, the hierarchical self-attention follows two steps: 1) each utterance representation is computed using an utterance-level self-attention layer, and 2) a context representation is computed from learned utterance representations using a context-level self-attention layer.

At step 1, for each utterance , =, …, , its representation is learned as follows:


where is linearly transformed from to form heads (), linearly transforms from heads back to 1 head, and


where , , and denote sets of queries, keys and values, respectively, and denote model parameters, and denotes the hidden size of the point-wise feedforward layer (FF) Vaswani et al. (2017). The multi-head self-attention layer (MH) enables our model to jointly attend to information from different representation subspaces Vaswani et al. (2017). The scaling factor is added to ensure the dot product of two vectors do not get overly large. Similar to Vaswani et al. (2017), both MH and FF layers are followed by residual connection and layer normalization, which are omitted in Equation 10 for brevity.

At step 2, to effectively combine all utterance representations in the context, the context-level self-attention layer is proposed to hierarchically learn the context-level representation as follows:


where denotes , which is the concatenation of all learned utterance representations in the context.

3.6 Context-Response Cross-Attention

Finally, a context-aware concept-enriched response representation for conversation is learned by cross-attention Bahdanau et al. (2014), which selectively attends to the concept-enriched context representation as follows:


where the response utterance representation is obtained via the MH layer:


The resulted representation is then fed into a max-pooling layer to learn discriminative features among the positions in the response and derive the final representation :


The output probability

is then computed as


where and denote model parameters, and denotes the number of classes. The entire KET model is optimized in an end-to-end manner as defined in Equation 1. Our model is available at here111

Model EC DailyDialog MELD EmoryNLP IEMOCAP
cLSTM 0.6913 0.4990 0.4972 0.2601 0.3484
CNN Kim (2014) 0.7056 0.4934 0.5586 0.3259 0.5218
CNN+cLSTM Poria et al. (2017) 0.7262 0.5024 0.5687 0.3289 0.5587
BERT_BASE Devlin et al. (2018) 0.6946 0.5312 0.5621 0.3315 0.6119
DialogueRNN Majumder et al. (2019) 0.7405 0.5065 0.5627 0.3170 0.6121
KET_SingleSelfAttn (ours) 0.7285 0.5192 0.5624 0.3251 0.5810
KET_StdAttn (ours) 0.7413 0.5254 0.5682 0.3353 0.5861
KET (ours) 0.7348 0.5337 0.5818 0.3439 0.5956
Table 2: Performance comparisons on the five test sets. Best values are highlighted in bold.
Dataset M m d p h
EC 2 30 200 100 4
DailyDialog 6 30 300 400 4
MELD 6 30 200 100 4
EmoryNLP 6 30 100 200 4
IEMOCAP 6 30 300 400 4
Table 3: Hyper-parameter settings for KET. : context length. : number of tokens per utterance. : word embedding size. : hidden size in FF layer. : number of heads.

4 Experimental Settings

In this section we present the datasets, evaluation metrics, baselines, our model variants, and other experimental settings.

4.1 Datasets and Evaluations

We evaluate our model on the following five emotion detection datasets of various sizes and domains. The statistics are reported in Table 1.

EC Chatterjee et al. (2019): Three-turn Tweets. The emotion labels include happiness, sadness, anger and other.

DailyDialog Li et al. (2017): Human written daily communications. The emotion labels include neutral and Ekman’s six basic emotions Ekman (1992), namely happiness, surprise, sadness, anger, disgust and fear.

MELD Poria et al. (2018): TV show scripts collected from Friends. The emotion labels are the same as the ones used in DailyDialog.

EmoryNLP Zahiri and Choi (2018): TV show scripts collected from Friends as well. However, its size and annotations are different from MELD. The emotion labels include neutral, sad, mad, scared, powerful, peaceful, and joyful.

IEMOCAP Busso et al. (2008): Emotional dialogues. The emotion labels include neutral, happiness, sadness, anger, frustrated, and excited.

In terms of the evaluation metric, for EC and DailyDialog, we follow Chatterjee et al. (2019) to use the micro-averaged F1 excluding the majority class (neutral), due to their extremely unbalanced labels (the percentage of the majority class in the test set is over 80%). For the rest relatively balanced datasets, we follow Majumder et al. (2019) to use the weighted macro-F1.

4.2 Baselines and Model Variants

For a comprehensive performance evaluation, we compare our model with the following baselines:

cLSTM: A contextual LSTM model. An utterance-level bidirectional LSTM is used to encode each utterance. A context-level unidirectional LSTM is used to encode the context.

CNN Kim (2014): A single-layer CNN with strong empirical performance. This model is trained on the utterance-level without context.

CNN+cLSTM Poria et al. (2017): An CNN is used to extract utterance features. An cLSTM is then applied to learn context representations.

BERT_BASE Devlin et al. (2018): Base version of the state-of-the-art model for sentiment classification. We treat each utterance with its context as a single document. We limit the document length to the last 100 tokens to allow larger batch size. We do not experiment with the large version of BERT due to memory constraint of our GPU.

DialogueRNN Majumder et al. (2019): The state-of-the-art model for emotion detection in textual conversations. It models both context and speakers information. The CNN features used in DialogueRNN are extracted from the carefully tuned CNN model. For datasets without speaker information, i.e., EC and DailyDialog, we use two speakers only. For MELD and EmoryNLP, which have 260 and 255 speakers, respectively, we additionally experimented with clipping the number of speakers to the most frequent ones (6 main speakers + an universal speaker representing all other speakers) and reported the best results.

KET_SingleSelfAttn: We replace the hierarchical self-attention by a single self-attention layer to learn context representations. Contextual utterances are concatenated together prior to the single self-attention layer.

KET_StdAttn: We replace the dynamic context-aware affective graph attention by the standard graph attention veličković2018graph.

4.3 Other Experimental Settings

We preprocessed all datasets by lower-casing and tokenization using Spacy222 We keep all tokens in the vocabulary333We keep tokens with minimum frequency of 2 for DailyDialog due to its large vocabulary size. We use the released code for BERT_BASE and DialogueRNN. For each dataset, all models are fine-tuned based on their performance on the validation set.

For our model in all datasets, we use Adam optimization Kingma and Ba (2014) with a batch size of 64 and learning rate of 0.0001 throughout the training process. We use GloVe embedding Pennington et al. (2014) for initialization in the word and concept embedding layers444We use GloVe embeddings from Magnitude Medium: For the class weights in cross-entropy loss for each dataset, we set them as the ratio of the class distribution in the validation set to the class distribution in the training set. Thus, we can alleviate the problem of unbalanced dataset. The detailed hyper-parameter settings for KET are presented in Table 3.

5 Result Analysis

In this section we present model evaluation results, model analysis, and error analysis.

Figure 3: Validation performance by KET. Top: different context length (). Bottom: different sizes of random fractions of ConceptNet.

5.1 Comparison with Baselines

We compare the performance of KET against that of the baseline models on the five afore-introduced datasets. The results are reported in Table 2. Note that our results for CNN, CNN+cLSTM and DialogueRNN on EC, MELD and IEMOCAP are slightly different from the reported results in Majumder et al. (2019); Poria et al. (2019).

cLSTM performs reasonably well on short conversations (i.e., EC and DailyDialog), but the worst on long conversations (i.e., MELD, EmoryNLP and IEMOCAP). One major reason is that learning long dependencies using gated RNNs may not be effective enough because the gradients are expected to propagate back through inevitably a huge number of utterances and tokens in sequence, which easily leads to the vanishing gradient problem

Bengio et al. (1994). In contrast, when the utterance-level LSTM in cLSTM is replaced by features extracted by CNN, i.e., the CNN+cLSTM, the model performs significantly better than cLSTM on long conversations, which further validates that modelling long conversations using only RNN models may not be sufficient. BERT_BASE achieves very competitive performance on all datasets except EC due to its strong representational power via bi-directional context modelling using the Transformer. Note that BERT_BASE has considerably more parameters than other baselines and our model (110M for BERT_BASE versus 4M for our model), which can be a disadvantage when deployed to devices with limited computing power and memory. The state-of-the-art DialogueRNN model performs the best overall among all baselines. In particular, DialogueRNN performs better than our model on IEMOCAP, which may be attributed to its detailed speaker information for modelling the emotion dynamics in each speaker as the conversation flows.

It is encouraging to see that our KET model outperforms the baselines on most of the datasets tested. This finding indicates that our model is robust across datasets with varying training sizes, context lengths and domains. Our KET variants KET_SingleSelfAttn and KET_StdAttn perform comparably with the best baselines on all datasets except IEMOCAP. However, both variants perform noticeably worse than KET on all datasets except EC, validating the importance of our proposed hierarchical self-attention and dynamic context-aware affective graph attention mechanism. One observation worth mentioning is that these two variants perform on a par with the KET model on EC. Possible explanations are that 1) hierarchical self-attention may not be critical for modelling short conversations in EC, and 2) the informal linguistic styles of Tweets in EC, e.g., misspelled words and slangs, hinder the context representation learning in our graph attention mechanism.

5.2 Model Analysis

We analyze the impact of different settings on the validation performance of KET. All results in this section are averaged over 5 random seeds.

Analysis of context length: We vary the context length and plot model performance in Figure 3 (top portion). Note that EC has only a maximum number of 2 contextual utterances. It is clear that incorporating context into KET improves performance on all datasets. However, adding more context is contributing diminishing performance gain or even making negative impact in some datasets. This phenomenon has been observed in a prior study Su et al. (2018). One possible explanation is that incorporating long contextual information may introduce additional noises, e.g., polysemes expressing different meanings in different utterances of the same context. More thorough investigation of this diminishing return phenomenon is a worthwhile direction in the future.

Analysis of the size of ConceptNet: We vary the size of ConceptNet by randomly keeping only a fraction of the concepts in ConceptNet when training and evaluating our model. The results are illustrated in Figure 3 (bottom portion). Adding more concepts consistently improves model performance before reaching a plateau, validating the importance of commonsense knowledge in detecting emotions. We may expect the performance of our KET model to improve with the growing size of ConceptNet in the future.

Dataset 0 0.3 0.7 1
EC 0.7345 0.7397 0.7426 0.7363
DailyDialog 0.5365 0.5432 0.5451 0.5383
MELD 0.5321 0.5395 0.5366 0.5306
EmoryNLP 0.3528 0.3624 0.3571 0.3488
IEMOCAP 0.5344 0.5367 0.5314 0.5251
Table 4: Analysis of the relatedness-affectiveness tradeoff on the validation sets. Each column corresponds to a fixed for all concepts (see Equation 8).
Dataset KET -context -knowledge
EC 0.7451 0.7343 0.7359
DailyDialog 0.5544 0.5282 0.5402
MELD 0.5401 0.5177 0.5248
EmoryNLP 0.3712 0.3564 0.3553
IEMOCAP 0.5389 0.4976 0.5217
Table 5: Ablation study for KET on the validation sets.

Analysis of the relatedness-affectiveness tradeoff: We experiment with different values of (see Equation 8) for all and report the results in Table 4. It is clear that makes a noticeable impact on the model performance. Discarding relatedness or affectiveness completely will cause significant performance drop on all datasets, with one exception of IEMOCAP. One possible reason is that conversations in IEMOCAP are emotional dialogues, therefore, the affectiveness factor in our proposed graph attention mechanism can provide more discriminative power.

Ablation Study: We conduct ablation study to investigate the contribution of context and knowledge as reported in Table 5. It is clear that both context and knowledge are essential to the strong performance of KET on all datasets. Note that removing context has a greater impact on long conversations than short conversations, which is expected because more contextual information is lost in long conversations.

5.3 Error Analysis

Despite the strong performance of our model, it still fails to detect certain emotions on certain datasets. We rank the F1 score of each emotion per dataset and investigate the emotions with the worst scores. We found that disgust and fear are generally difficult to detect and differentiate. For example, the F1 score of fear emotion in MELD is as low as 0.0667. One possible cause is that these two emotions are intrinsically similar. The VAD values of both emotions have low valence, high arousal and low dominance Mehrabian (1996). Another cause is the small amount of data available for these two emotions. How to differentiate intrinsically similar emotions and how to effectively detect emotions using limited data are two challenging directions in this field.

6 Conclusion

We present a knowledge-enriched transformer to detect emotions in textual conversations. Our model learns structured conversation representations via hierarchical self-attention and dynamically refers to external, context-aware, and emotion-related knowledge entities from knowledge bases. Experimental analysis demonstrates that both contextual information and commonsense knowledge are beneficial to model performance. The tradeoff between relatedness and affectiveness plays an important role as well. In addition, our model outperforms the state-of-the-art models on most of the tested datasets of varying sizes and domains.

Given that there are similar emotion lexicons to NRC_VAD in other languages and ConceptNet is a multilingual knowledge base, our model can be easily adapted to other languages. In addition, given that NRC_VAD is the only emotion-specific component, our model can be adapted as a generic model for conversation analysis.


The authors would like to thank the anonymous reviewers for their valuable comments. This research is supported, in part, by the National Research Foundation, Prime Minister’s Office, Singapore under its AI Singapore Programme (Award Number: AISG-GC-2019-003) and under its NRF Investigatorship Programme (NRFI Award No. NRF-NRFI05-2019-0002). This research is also supported, in part, by the Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University, Singapore.


  • M. Abdul-Mageed and L. Ungar (2017) Emonet: fine-grained emotion detection with gated recurrent neural networks. In ACL, Vol. 1, pp. 718–728. Cited by: §2.
  • N. Asghar, P. Poupart, J. Hoey, X. Jiang, and L. Mou (2018) Affective neural response generation. In ECIR, pp. 154–166. Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §3.6.
  • A. Bandhakavi, N. Wiratunga, S. Massie, and D. Padmanabhan (2017) Lexicon generation for emotion detection from text. IEEE Intelligent Systems 32 (1), pp. 102–108. Cited by: §2.
  • Y. Bengio, P. Simard, P. Frasconi, et al. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5 (2), pp. 157–166. Cited by: §5.1.
  • C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42 (4), pp. 335. Cited by: §4.1.
  • A. Chatterjee, U. Gupta, M. K. Chinnakotla, R. Srikanth, M. Galley, and P. Agrawal (2019) Understanding emotions in text using deep learning and big data. Computers in Human Behavior 93, pp. 309 – 317. External Links: ISSN 0747-5632 Cited by: §1, §2, §4.1, §4.1.
  • J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. In EMNLP, pp. 551–561. Cited by: §1.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.
  • Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §2.
  • L. Devillers, I. Vasilescu, and L. Lamel (2002) Annotation and detection of emotion in a task-oriented human-human dialog corpus. In Proceedings of ISLE Workshop, Cited by: §2.
  • L. Devillers and L. Vidrascu (2006) Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs. In Ninth International Conference on Spoken Language Processing, Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, Table 2, §4.2.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2019) Wizard of wikipedia: knowledge-powered conversational agents. In ICLR, Cited by: §2.
  • P. Ekman (1992) An argument for basic emotions. Cognition & emotion 6 (3-4), pp. 169–200. Cited by: §4.1.
  • C. Fellbaum (2012) WordNet. The Encyclopedia of Applied Linguistics. Cited by: §2.
  • M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, and M. Galley (2018) A knowledge-grounded neural conversation model. In AAAI, Cited by: §2.
  • H. Hajishirzi (2019) Text Generation from Knowledge Graphs with Graph Transformers. In NAACL, Cited by: §2.
  • S. Han, J. Bang, S. Ryu, and G. G. Lee (2015) Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of the 16th SIGDIAL, pp. 129–133. Cited by: §2.
  • Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, and J. Zhao (2017) An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. In ACL, pp. 221–231. Cited by: §2.
  • D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann (2018a) ICON: interactive conversational memory network for multimodal emotion detection. In EMNLP, pp. 2594–2604. Cited by: §1, §2.
  • D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L. Morency, and R. Zimmermann (2018b) Conversational memory network for emotion recognition in dyadic dialogue videos. In NAACL, Vol. 1, pp. 2122–2132. Cited by: §1, §2.
  • J. He, B. Wang, M. Fu, T. Yang, and X. Zhao (2019) Hierarchical attention and knowledge matching networks with information enhancement for end-to-end task-oriented dialog systems. IEEE Access 7, pp. 18871–18883. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2.
  • Y. Hu, X. Chen, and D. Yang (2009) Lyric-based song emotion detection with affective lexicon and fuzzy clustering method.. In ISMIR, pp. 123–128. Cited by: §2.
  • E. Hudlicka (2011) Guidelines for designing computational models of emotions. International Journal of Synthetic Emotions 2 (1), pp. 26–79. Cited by: §1.
  • H. Khanpour and C. Caragea (2018) Fine-grained emotion detection in health-related online posts. In EMNLP, pp. 1160–1166. Cited by: §2.
  • C. Kiddon, L. Zettlemoyer, and Y. Choi (2016) Globally coherent text generation with neural checklist models. In EMNLP, pp. 329–339. Cited by: §2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: Table 2, §4.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
  • A. Kumar, D. Kawahara, and S. Kurohashi (2018) Knowledge-enriched two-layered attention network for sentiment analysis. In NAACL, Vol. 2, pp. 253–258. Cited by: §2.
  • C. M. Lee and S. S. Narayanan (2005) Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing 13 (2), pp. 293–303. Cited by: §2.
  • Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) DailyDialog: a manually labelled multi-turn dialogue dataset. In IJCNLP, Vol. 1, pp. 986–995. Cited by: Figure 1, §4.1.
  • S. Liu, H. Chen, Z. Ren, Y. Feng, Q. Liu, and D. Yin (2018) Knowledge diffusion for neural dialogue generation. In ACL, pp. 1489–1498. Cited by: §2.
  • A. Madotto, C. Wu, and P. Fung (2018) Mem2Seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In ACL, Vol. 1, pp. 1468–1478. Cited by: §2.
  • N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria (2019) Dialoguernn: an attentive rnn for emotion detection in conversations. In AAAI, Cited by: §1, §2, Table 2, §4.1, §4.2, §5.1.
  • A. Mehrabian (1996) Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Current Psychology 14 (4), pp. 261–292. Cited by: §3.2, §5.3.
  • T. Mihaylov and A. Frank (2018) Knowledgeable reader: enhancing cloze-style reading comprehension with external commonsense knowledge. In ACL, pp. 821–832. Cited by: §2.
  • N. Moghe, S. Arora, S. Banerjee, and M. M. Khapra (2018) Towards exploiting background knowledge for building conversation systems. In EMNLP, pp. 2322–2332. Cited by: §2.
  • S. M. Mohammad (2018a) Word affect intensities. In LREC, Cited by: §3.4.
  • S. Mohammad (2018b) Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In ACL, pp. 174–184. Cited by: §3.2.
  • B. Pang, L. Lee, and S. Vaithyanathan (2002) Thumbs up?: sentiment classification using machine learning techniques. In EMNLP, pp. 79–86. Cited by: §2.
  • P. Parthasarathi and J. Pineau (2018) Extending neural generative conversational model using external knowledge sources. In EMNLP, pp. 690–695. Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §4.3.
  • S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L. Morency (2017) Context-dependent sentiment analysis in user-generated videos. In ACL, Vol. 1, pp. 873–883. Cited by: §1, §2, §3.1, Table 2, §4.2.
  • S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea (2018) MELD: a multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508. Cited by: §4.1.
  • S. Poria, N. Majumder, R. Mihalcea, and E. Hovy (2019) Emotion recognition in conversation: research challenges, datasets, and recent advances. arXiv preprint arXiv:1905.02947. Cited by: §2, §5.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §2.
  • A. Seyeditabari, N. Tabari, and W. Zadrozny (2018) Emotion detection in text: a review. arXiv preprint arXiv:1806.00674. Cited by: §2.
  • D. Shen, G. Wang, W. Wang, M. R. Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin (2018) Baseline needs more love: on simple word-embedding-based models and associated pooling mechanisms. In ACL, pp. 440–450. Cited by: §3.4.
  • R. Speer, J. Chin, and C. Havasi (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In AAAI, Cited by: §3.2.
  • S. Su, P. Yuan, and Y. Chen (2018) How time matters: learning time-decay attention for contextual spoken language understanding in dialogues. In NAACL, Vol. 1, pp. 2133–2142. Cited by: §3.1, §5.2.
  • H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, and W. Cohen (2018) Open domain question answering using early fusion of knowledge bases and text. In EMNLP, pp. 4231–4242. Cited by: §2.
  • M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede (2011) Lexicon-based methods for sentiment analysis. Computational Linguistics 37 (2), pp. 267–307. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1, §3.3, §3.5.
  • S. Wang and C. D. Manning (2012) Baselines and bigrams: simple, good sentiment and topic classification. In ACL, pp. 90–94. Cited by: §2.
  • C. Wu, R. Socher, and C. Xiong (2019) Global-to-local memory pointer networks for task-oriented dialogue. In ICLR, Cited by: §2.
  • T. Young, E. Cambria, I. Chaturvedi, H. Zhou, S. Biswas, and M. Huang (2018) Augmenting end-to-end dialogue systems with commonsense knowledge. In AAAI, Cited by: §2.
  • S. M. Zahiri and J. D. Choi (2018) Emotion detection on tv show transcripts with sequence-based convolutional neural networks. In Workshops at AAAI, Cited by: §2, §4.1.
  • J. Zhang, H. Luan, M. Sun, F. Zhai, J. Xu, M. Zhang, and Y. Liu (2018a) Improving the transformer translation model with document-level context. In EMNLP, pp. 533–542. Cited by: §2.
  • Y. Zhang, J. Fu, D. She, Y. Zhang, S. Wang, and J. Yang (2018b) Text emotion distribution learning via multi-task convolutional neural network.. In IJCAI, pp. 4595–4601. Cited by: §2.
  • P. Zhong and C. Miao (2019) Ntuer at SemEval-2019 task 3: emotion classification with word and sentence representations in RCNN. In SemEval, pp. 282–286. Cited by: §2.
  • P. Zhong, D. Wang, and C. Miao (2019) An affect-rich neural conversational model with biased attention and weighted cross-entropy loss. In AAAI, pp. 7492–7500. Cited by: §2.
  • H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu (2018a) Emotional chatting machine: emotional conversation generation with internal and external memory. In AAAI, Cited by: §1.
  • H. Zhou, T. Young, M. Huang, H. Zhao, J. Xu, and X. Zhu (2018b) Commonsense knowledge aware conversation generation with graph attention.. In IJCAI, pp. 4623–4629. Cited by: §1, §2.
  • X. Zhou, L. Li, D. Dong, Y. Liu, Y. Chen, W. X. Zhao, D. Yu, and H. Wu (2018c) Multi-turn response selection for chatbots with deep attention matching network. In ACL, Vol. 1, pp. 1118–1127. Cited by: §2.