BiERU: Bidirectional Emotional Recurrent Unit for Conversational Sentiment Analysis

05/31/2020 ∙ by Wei Li, et al. ∙ aalto Nanyang Technological University 0

Sentiment analysis in conversations has gained increasing attention in recent years for the growing amount of applications it can serve, e.g., sentiment analysis, recommender systems, and human-robot interaction. The main difference between conversational sentiment analysis and single sentence sentiment analysis is the existence of context information which may influence the sentiment of an utterance in a dialogue. How to effectively encode contextual information in dialogues, however, remains a challenge. Existing approaches employ complicated deep learning structures to distinguish different parties in a conversation and then model the context information. In this paper, we propose a fast, compact and parameter-efficient party-ignorant framework named bidirectional emotional recurrent unit for conversational sentiment analysis. In our system, a generalized neural tensor block followed by a two-channel classifier is designed to perform context compositionality and sentiment classification, respectively. Extensive experiments on three standard datasets demonstrate that our model outperforms the state of the art in most cases.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sentiment analysis is of vital importance in dialogue systems and has recently gained increasing attention [1]. It can be applied to a lot of scenarios such as mining the opinions of speakers in conversations, improving the feedback of robot agents, and so on. Moreover, sentiment analysis in live conversations can be used in generating talks with certain sentiments to improve human-machine interaction. Existing approaches to conversational sentiment analysis can be divided into party-dependent approaches, like DialogueRNN [2], and party-ignorant approaches, such as AGHMN [3]

. Party-dependent methods distinguish different parties in a conversation while party-ignorant methods do not. In this paper, we propose a fast, compact and parameter-efficient party-ignorant framework based on emotional recurrent unit (ERU), a recurrent neural network that contains a generalized neural tensor block (GNTB) and an emotion feature extractor (EFE) to tackle conversational sentiment analysis.

Context information is the main difference between dialogue sentiment analysis and single sentence sentiment analysis tasks. It sometimes enhances, weakens, or reverses the raw sentiment of an utterance (Fig. 1). There are three main steps for sentiment analysis in a conversation: obtaining the context information; capturing the influence of the context information for an utterance; and extracting emotional features for classification. Existing dialogue sentiment analysis methods like c-LSTM [4], CMN [5], DialogueRNN [2], and DialogueGCN [6] make use of complicated deep neural network structures to capture context information and describe the influence of context information for an utterance.

We redefine the formulation of conversational sentiment analysis and provide a compact structure to better encode the context information, capture the influence of context information for an utterance, and extract emotional features. To this end, we design GNTB to perform context compositionality which obtains context information and incorporates the context into utterance representation simultaneously, then employ EFE to extract emotional features. In this case, we convert the previous three-step task into a two-step task. Meanwhile, the compact structure reduces computational cost. To the best of our knowledge, our proposed model is the first to perform context compositionality in conversational sentiment analysis.

The GNTB takes the context and current utterance as inputs, and is capable of modeling conversations with arbitrary turns. It outputs a new representation of current utterance with context information incorporated (named as ‘contextual utterance vector’ in this paper). Then, the contextual utterance vector is further fed into EFE to extract emotional features. Here, we employ a simple two-channel model for emotion feature extraction.

The long short-term memory (LSTM) unit 


and one-dimensional convolutional neural network (CNN) 

[8] are utilized for extracting features from the contextual utterance vector. Extensive experiments on three standard datasets demonstrate that our model outperforms state-of-the-art methods with less parameters. To summarize, the main contributions of this paper are as follows:

  • We propose a fast, compact and parameter-efficient party-ignorant framework based on emotional recurrent unit.

  • We design generalized neural tensor block which is suitable for different structures, to perform context compositionality.

  • Experiments on three standard benchmarks indicate that our model outperforms the state of the art with less parameters.

The remainder of the paper is organized as follows: related work is introduced in Section II; the mechanism of our model is explained in Section III; results the experiments are discussed in Section IV; finally, concluding remarks are provided in Section V.

Fig. 1: Illustration of dialogue system and the interaction between talkers.

Ii Related Work

Sentiment analysis is one of the key NLP tasks that has drawn great attention from the research community [9]. It is helpful for understanding people’s intention [10] and proactive social care for mental health issues such as depression [11] and suicidal intention [12]. Previous studies extensively explored sentiment classification on general text [13] and various specific domains such as tourism [14], finance [15], and marketing [16]. Bandhakavi et al. [17]

proposed a lexicon-based method to enhance feature extraction. SenticNet 5 

[18] learns conceptual primitives for sentiment analysis by coupling symbolic and subsymbolic AI. Akhtar et al. [19] predicted intensities of emotions and sentiments using stacked ensemble.

Recently, sentiment analysis in dialogues has become a new trend. Poria et al. [4] proposed context-dependent LSTM network to capture contextual information for identifying sentiment over video sequences and Ragheb et al. [20] utilized self attention to prioritize important utterances. Memory networks [21], which introduce an external memory module, was applied to modeling historical utterances in conversations. For example, CMN [5] modeled dialogue histories into memory cells, ICON [22] proposed global memories for bridging self- and inter-speaker emotional influences and AGHMN [3] proposed hierarchical memory network as utterance reader. Recent advances in deep learning were also introduced to conversational sentiment analysis like attentive RNN [2], adversarial training [23], and graph convolutional networks [6].

Another related work is Neural Tensor Networks (NTN) [24] which is first proposed for reasoning over relational data. It is further extended to capture semantic compositionality for sentiment analysis [25]. The authors proposed a tensor-based composition function to learn sentence representation recursively. It solves the issue when words function as operators that change the meaning of another word. NTN has also been used for modeling relationships within multitasks. Majumder et al. [26] applied NTN for inter-task fusion between tasks of sentiment and sarcasm classification.

Iii Method

Iii-a Problem Definition

Given a multiple turns conversation , the task is to predict the sentiment labels or sentiment intensities of the constituent utterances . Taking the interactive emotional database IEMOCAP [27] as an example, emotion labels include frustrated, excited, angry, neutral, sad and happy.

In general, the task is formulated as a multi-class classification problem over sequential utterances; while in some scenarios, it is regarded as a regression problem given continuous sentiment intensity. In this paper, utterances are pre-processed and represented as using feature extractors described below.

Iii-B Textual Feature Extraction

Following the tradition of DialogueRNN [2], utterances are first embedded into vector space and then fed into CNNs [8]

for feature extraction. N-gram features are obtained from each utterance by applying three different convolution filters of sizes 3, 4 and 5, respectively. Each filter has 50 features-maps. 


then use max-pooling followed by rectified linear unit (ReLU) activation 

[28] to process the outputs of convolution operation.

These activation values are concatenated and fed to a 100 dimensional fully connected layer whose outputs serve as the textual utterance representation. This CNN-based feature extraction network is trained at utterance level supervised by the sentiment labels.

Iii-C Our Model

Fig. 2: (a) Architecture of BiERU with global context. (b) Architecture of BiERU with local context. Here , , and are forward contextual utterance vector, EFE, and GNTB, respectively. and stand for backward contextual utterance vector and ERU, respectively. is the predicted possibility vector of sentiment labels. A, T, V are audio, textual, and visual modalities, respectively. In our model, we only focus on textual modality. The detailed structures of GNTB and EFE are shown in Fig. 3.

Our ERU is illustrated in Note 1 of Fig. 2, which consists of two components GNTB and EFE. As mentioned in the introduction, there are three main steps for conversational sentiment analysis, namely obtaining the context representation; incorporating the influence of the context information into an utterance; and extracting emotional features for classification. In this paper, the ERU is employed in a bidirectional manner (BiERU) to conduct the above sentiment analysis task, reducing some expensive computations and converting the previous three-step task into a two-step task as shown in Fig. 2.

Similar to bidirectional LSTM (BiLSTM) [29], two ERUs are utilized for forward and backward passing the input utterances. Outputs from the forward and backward ERUs are concatenated for sentiment classification or regression. More concretely, the GNTB is applied to encoding the context information and incorporating it into an utterance simultaneously; while EFE takes the output of GNTB as input and is used to obtain emotional features for classification or regression.

Iii-C1 Generalized Neural Tensor Block

The utterance vector with the context information incorporated is named as contextual utterance vector in this paper, where is the dimension of and . At time , GNTB (Fig. 3: (a)) takes and as inputs and then outputs , a contextual utterance vector. In this process, GNTB first extracts the context information from ; then it incorporates the context information into ; finally contextual utterance vector is obtained. The first step is to capture the context information and the second step is to integrate the context information into current utterance. The combination of these two steps is regarded as context compositionality in this paper. To the best of our knowledge, this is the first work to perform context compositionality in conversational sentiment analysis. GNTB is the core part that achieves the context compositionality. The formulation of GNTB is described below:


where is the concatenation of and ;

is an activation function, such as

, and so on; the tensor and the matrix are the parameters used to calculate . Each slice can be interpreted as capturing a specific type of context compositionality. Each slice maps contextual utterance vector and utterance vector into the context compositionality space. Here we have different context compositionality types, which constitutes -dimensional context compositionality space. The main advantage over the previous neural tensor networks (NTN) [24], which is a special case of the GNTB when is set to , is that GNTB is suitable for different structures rather than only the recursive structure and the space complexity of GNTB is compared with in NTN. In order to further reduce the number of parameters, we employ the following low-rank matrix approximation for each slice :


where , , and .

Iii-C2 Emotion Feature Extractor

We utilize EFE to refine the emotion features from contextual vector . As shown in Fig. 3: (b), the EFE is a two-channel model, including a LSTM cell [7] branch and a one-dimensional CNN [8] branch. The two branches receive the same contextual utterance vector and produce outputs independently.

At time , the LSTM cell takes hidden state , cell state and the contextual utterance vector as inputs, where and are obtained from the last time step . The outputs of the LSTM cell are updated hidden state and cell state . The hidden state is regarded as the emotion feature vector. The CNN receives as input and outputs the emotion feature vector . Finally, the outputs of LSTM cell branch and CNN branch are concatenated into an emotion feature vector which is also the output of ERU. The formulas of EFE are as follows:


Iii-C3 Sentiment Classification & Regression

Taking emotion feature as input, we use a linear neural network

followed by a softmax layer to predict the sentiment labels, where

is the number of sentiment labels.

Then, we obtain the probability distribution

of the sentiment labels. Finally, we take the most possible sentiment class as the sentiment label of the utterance :


For sentiment regression task, we use a linear neural network to predict the sentiment intensity. Then, we obtain the predicted sentiment intensity :


where , , , is a scalar and is the predicted sentiment label for utterance .

Iii-C4 Training

For classification task, we choose cross-entropy as the measure of loss, and use L2-regularization to relieve overfitting. The loss function is:


For regression task, we choose mean square error (MSE) to measure loss, and L2-regularization to relieve overfitting. The loss function is:


where N is the number of samples/conversations, is the probability distribution of sentiment labels for utterance of conversation , is the expected class label of utterance of conversation , is the predicted sentiment intensity of utterance of conversation , is the expected sentiment intensity of utterance of conversation , c(i) is the number of utterances in sample , is the L2-regularization weight, and

is the set of trainable parameters. We employ stochastic gradient descent based Adam 

[30] optimizer to train our network.

Iii-D Bidirectional Emotion Recurrent Unit Variants

Our model has two different forms according to the source of context information, namely bidirectional emotion recurrent unit with global context (BiERU-gc) and bidirectional emotion recurrent unit with local context (BiERU-lc).

Fig. 3: (a) GNTB when . (b) EFE. The input of LSTM and CNN is context utterance vector , and output is emotion features .

Iii-D1 BiERU-gc

According to equation (1), GNTB extracts the context information from , integrates the context information into , and thus obtains the contextual utterance vector . Based on the definition of contextual utterance vector, is the utterance vector that contains information of and . In this case, the contextual utterance vector holds the context information from all the preceding utterances in a recurrent manner. As shown in Fig. 2 : (a), bidirectional ERUs enable to capture not only the context information from preceding utterances, but also the context information from the future utterances . The BiERU in Fig. 2 :(a) is named as BiERU-gc.

Iii-D2 BiERU-lc

Following equation (1), GNTB extracts the context information from the contextual utterance vector , and contains the context information of all the preceding utterances as mentioned above. If replacing with in equation (1) and (2), contains the information of and . In other words, is not only an utterance vector, but also works as the context of . As shown in Fig. 2 : (b), bidirectional ERU makes obtain the future information . In this case, GNTB extracts the context information from and , which are the adjacent utterances of . We name this model as BiERU-lc.

Iv Experiments

In this section, we conduct a series of comparative experiments to evaluate the performance of our proposed model (Codes will be available on our GitHub111 and perform a thorough analysis.

Iv-a Datasets

We use three datasets for experiments, i.e., AVEC [31], IEMOCAP [27] and MELD [32], which are also used by some representative models such as DialogueRNN [2] and DialogueGCN [6]. We conduct the standard data partition rate (details in Table I).

Dataset Partition  Utterance Count  Dialogue Count
IEMOCAP train + val 5810 120
test 1623 31
AVEC train + val 4368 63
test 1430 32
MELD train + val 11098 1153
test 2610 280
TABLE I: Statistical information and data partition of datasets used in this paper.

Originally, these three datasets are multimodal datasets. Here, we focus on the task of textual conversational sentiment analysis, and only use the textual modality to conduct our experiments.


The IEMOCAP [27] is a dataset of two-way conversations involving with ten distinct participators. It is recorded as videos where every video clip contains a single dyadic dialogue, and each dialogue is further segmented into utterances. Each utterance is labeled as one sentiment label from six sentiment labels, i.e., happy, sad, neural, angry, excited and frustrated. The dataset includes three modalities: audio, textual and visual. Here we only use textual modality data in experiments.


The AVEC dataset [31] is a modified version of the SEMAINE database [33] that contains interactions between human speakers and robots. Unlike IEMOCAP, each utterance in the AVEC dataset is given an annotation every 0.2 second with one of four real valued attributes, i.e, valence (), arousal (), expectancy (), and power (). Our experiments use the processed utterance-level annotation [2], and treat four affective attributes as four subsets for evaluation.


The MELD [32] is a multimodal and multiparty sentiment analysis/classification database. It contains textual, acoustic, and visual information for more than 13000 utterances from Friends TV series. The sentiment label of each utterance in a dialogue lies within one of the following seven sentiment classes: fear, neutral, anger, surprise, sadness, joy and disgust.

Iv-B Baselines and Settings

To evaluate performance of our model, we choose the following models as strong baselines including the state-of-the-art methods.

c-LSTM [4]

The c-LSTM uses bidirectional LSTM [7] to learn contextual representation from the surrounding utterances. When combined with the attention mechanism, it becomes the c-LSTM+Att.

Cmn [5]

This model utilizes memory network and two different GRUs [34] for two speakers for representation learning of utterance context from dialogue history.

DialogueRNN [2]

It distinguishes different parties in a conversation interactively, with three GRUs representing the speaker states, context, and emotion. It has several variants including DialogueRNN+Att with attention mechanism and bidirectional BiDialgoueRNN.

DialogueGCN [6]

This model employs graph neural network based approach through which context propagation issue can be addressed, to detect sentiment in conversations.

Aghmn [3]

It utilizes hierarchical memory networks with BiGRUs for utterance reader and fusion, and attention mechanism for memory summarizing.


All the experiments are performed using CNN extracted features as described in Method section. For fair comparison with the state-of-the-art DialogueRNN model, we use their utterance representation directly222Extracted features of two datasets are available at

To alleviate over-fitting, we employ Dropout [35]

over the outputs of GNTB and EFE. For the nonlinear activation function, we choose the sigmoid function for sentiment classification and the relu function for sentiment regression. Our model is optimized by an Adam optimizer 

[30]. Hyper-parameters are tuned manually. Batch size is set as 1. We set the rank for all the experiments to

. Our model is implemented using PyTorch 


Happy Sad Neutral Angry Excited Frustrated Average Average
Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc.
c-LSTM 30.56 35.63 56.73 62.90 57.55 53.00 59.41 59.24 52.84 58.85 65.88 59.41 56.32 56.19 57.50
CMN 25.00 30.38 55.92 62.41 52.86 52.39 61.76 59.83 55.52 60.25 71.13 60.69 56.56 56.13 -
DialogueRNN 25.69 33.18 75.10 78.80 58.59 59.21 64.71 65.28 80.27 71.86 61.15 58.91 63.40 62.75 56.10
DialogueGCN 40.62 42.75 89.14 84.54 61.92 63.54 67.53 64.19 65.46 63.08 64.18 66.99 65.25 64.18 -
AGHMN 48.30 52.1 68.30 73.3 61.60 58.4 57.50 61.9 68.10 69.7 67.10 62.3 63.50 63.50 60.30
BiERU-gc 54.84 33.01 80.80 81.62 63.06 61.02 71.43 66.25 62.06 73.00 60.48 60.16 65.37 64.20 60.61
BiERU-lc 54.24 31.53 80.60 84.21 64.67 60.17 67.92 65.65 62.79 74.07 61.93 61.27 66.11 64.65 60.84
TABLE II: Comparison with baselines on IEMOCAP and MELD datasets using textual modality. Average score of accuracy and f1-score are weighted. “-” represents no results reported in original paper.
Methods AVEC
Valence Arousal Expectancy Power
r r r r
c-LSTM 0.16 0.25 0.24 0.10
CMN 0.23 0.29 0.26 -0.02
DialogueRNN 0.35 0.59 0.37 0.37
BiERU-gc 0.30 0.63 0.36 0.36
BiERU-lc 0.36 0.64 0.38 0.37
TABLE III: Comparison with baselines on AVEC dataset using textual modality. stands for Pearson correlation coefficient.

Iv-C Results

We compare our model with baselines on textual modality using three standard benchmarks. Overall, our model outperforms all the baseline methods including state-of-the-art models like DialogueRNN, DialogueGCN and AGHMN on these datasets, and markedly exceeds in some indicators as the results show in Table II.

For the IEMOCAP dataset as a classification problem, we use accuracy for each class, and weighted average of accuracy and f1-score for measuring the overall performance. As to the AVEC dataset, standard metrics for regression task including Pearson correlation coefficient () are used for evaluation. We use weighted average of accuracy as the measure of performance on MELD dataset.

Iv-C1 Comparison with the State of the Art

We firstly compare our proposed BiERU with state-of-the-art methods DialogueGCN, DialogueRNN and AGHMN on IEMOCAP, AVEC and MELD, respectively.


As shown in Table II, our proposed BiERU-gc model exceeds the best model DialogueGCN by and in terms of weighted average accuracy and f1-score, respectively. And the BiERU-lc model pushes up state-of-the-art results by and for weighted average accuracy and f1-score, respectively. For all 14 indicators on IEMOCAP dataset, our models outperform at 7 indicators and has more balanced performances over these six classes. In particular, accuracy of “happy” of our proposed BiERU-gc is higher than the result of DialgoueGCN by . The experimental results indicate that BiERU model can effectively capture the contextual information and extract rich emotion features to boost the overall performance and achieve relatively balanced results.


Among these four attributes, our model outperforms DialogueRNN for ”valence”, “arousal” and “expectancy” attributes and gets the same results on ”power” attribute. The pearson correlation coefficient of BiERU-gc is higher than its counterpart in terms of “arousal” (Table III). As for the BiERU-lc model, it is higher in . For the attributes “expectancy” and ”valence”, the BiERU-lc model is higher in . As for the attribute“power”, although our best model does not outperform the state-of-the-art method, it surpasses most of the other baseline methods including CMN and c-LSTM. Overall, BiERU-lc model works well on all the attributes, considering the benchmark performances are very high.


Three factors make it considerably harder to model sentiment analysis on MELD in comparison with IEMOCAP and AVEC datasets. First, the average number of turns in a MELD conversation is 10 while it is close to 50 on IEMOCAP. Second, there are more than 5 speakers in most of the MELD conversations, which means most of the speakers only utter one or two utterances per conversation. What’s worse, sentiment expressions rarely exists in MELD utterances and the average length of MELD utterances is much shorter than it is in IEMOCAP and AVEC datasets. For party-dependent model like DialogueRNN, it is hard to model inter-dependency between speakers. We find that the performances of party-ignorant models such as c-LSTM and AGHMN are slightly better than party-dependent models on this dataset. Our BiERU models utilize GNTB to perform context compositionality and achieve state-of-the-art average accuracy of outperforming AGHMN by and DialogueRNN by .

Iv-C2 Comparison between BiERU-gc and BiERU-lc

The proposed two variants take different context inputs. The BiERU-gc model takes the output of GNTB at last time step and current utterance as the input of GNTB at current time step. And the BiERU-lc model uses the last utterance and current utterance as input of GNTB at current time step. According to experimental results in Tables II and III, the overall performance of BiERU-lc is better than BiERU-gc.

For IEMOCAP datasets, the BiERU-lc model surpasses BiERU-gc model by and in terms of weighted average accuracy and f1-score, respectively. For the AVEC and MELD datasets, BiERU-lc also outperforms its counterpart. One possible explanation is that context information of a contextual utterance vector in BiERU-gc comes from all utterances in the current conversation. However, in BiERU-lc, the context information comes from neighborhood utterances. In this case, context information of BiERU-gc contains redundant information and thus has a negative impact on emotion feature extraction.

Fig. 4:

Heat map of confusion matrix of BiERU-lc.

Iv-D Case Study

Fig. 5 illustrates a conversation snippet classified by BiERU-lc method. In this snippet, person A is initially at a frustrated state while person B acts as a listener in the beginning. Then, person A changes his/her focus and questions person B on his/her job state. Person B tries to use his/her own experience to help person A get rid of the frustrating state. This snippet reveals that the sentiment of a speaker is relatively steady and the interaction between speakers may change the sentiment of a speaker. Our BiERU-lc method shows good ability in capturing the speaker’s sentiment (turns 9, 11, 12, 14) and the interaction between speakers (turn 10). The sentiment in turn 13 is very subtly. Turn 13 contains a little bit of frustration since he/she is not satisfied with his/her job state. However, considering that person B attempts to help person A, turn 13 is more likely to be in a neutral stand.

Fig. 5: Illustration of a conversation snippet classified by BiERU-lc.

Iv-E Visualization

We use visualization to provide some insights of the proposed model. Firstly, we visualize the confusion matrix in the form of a heat map to describe the performance of our BiERU-lc model. The heat maps of BiERU-lc on the IEMOCAP dataset are shown in Fig. 4. Our model has a balanced performance over all the sentiment classes.

Secondly, we perform deeper analysis of our proposed model and DialogueRNN by visualizing the learned emotion feature representations on IEMOCAP as shown in Fig. 6 and Fig. 6

. Vectors fed into the last dense layer followed by softmax for classification are regarded as emotion feature representations of utterances. We use principal component analysis 

[37] to reduce the dimension of emotion representations from our model (BiERU-lc) and DialogueRNN. The emotion representation is reduced to 3-dimensional. In Fig. 6 and Fig. 6, each color represents a predicted sentiment label and the same color means the same sentiment label. The figures show that our model outperforms on extracting emotion features of utterances labeled with “happy”, which is consistent with the results in Table 2.

Fig. 6: Visualization of learned emotion features via dimensionality reduction.

Iv-F Efficiency Analysis

Our proposed model has advantages over DialogueRNN, which is the only one competitive method with public source code, in terms of convergence capacity, the number of trainable parameters, and training time. From the training curve in Fig. 6(a), our model shows comparable convergence speed with its counterpart, but DialogueRNN turns to be easier to overfitting. Furthermore, BiERU with low-rank matrix approximation has fewer trainable parameters. For 100D feature input in IEMOCAP dataset, it has about 0.5M parameters, while DialogueRNN requires around 1M. For 600D MELD dataset, DialogueRNN has 2.9M parameters, and our model only has 0.6M. With much fewer parameters, our model consequently trains faster than its counterpart as shown in Fig. 6(b), where training time is logged in a single NVIDIA GeForce GTX 965M. Our model is more parameter-efficient and less time-consuming for training.

(a) Training curve
(b) Time consumption
Fig. 7: Training curve and time consumption. dRNN is the abbreviation of DialogueRNN.

Iv-G Ablation Study

To further explore our proposed BiERU model, we perform ablation study on its two main components, i.e., GNTB and EFE. We conduct experiments on the IEMOCAP dataset with individual GNTB and EFE module separately, and their combination, i.e., the complete BiERU. Experimental results on the IEMOCAP dataset are illustrated in Table IV.

The performance of sole GNTB or EFE is low in terms of accuracy and f1-score. The reason is that outputs of GNTB mainly contain context information and outputs of EFE lack context information. However, when these two modules are combined together as the BiERU model, the accuracy and f1-score increase dramatically, which proves the effectiveness of our BiERU model. More importantly, the GNTB and EFE modules couples significantly well to enhance the performance.

GNTB EFE Accuracy F1-score
- + 55.45 55.17
+ - 49.85 49.42
+ + 65.93 64.63
TABLE IV: Results of ablated BiERU on the IEMOCAP dataset. Accuracy and F1-score are weighted average.

V Conclusion

In this paper, we proposed a fast, compact and parameter-efficient party-ignorant framework bidirectional emotional recurrent unit (BiERU) for sentiment analysis in conversations. Our proposed generalized neural tensor block (GNTB), skilled at context compositionality, reduced the number of parameters and was suitable for different structures. Additionally, our EFE is capable of extracting high-quality emotion features for sentiment analysis. We proved that it is feasible to both simplify the model structure and improve performance simultaneously.

Our model outperforms current state-of-the-art models on three standard datasets in most cases. In addition, our method has the ability to model conversations with arbitrary turns and speakers, which plan to study further in the future.


This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project #A18A2b0046).


  • [1] Y. Ma, K. L. Nguyen, F. Xing, and E. Cambria, “A survey on empathetic dialogue systems,” Information Fusion, 2020.
  • [2] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “DialogueRNN: An attentive rnn for emotion detection in conversations,” in

    Proceedings of the AAAI Conference on Artificial Intelligence

    , vol. 33, 2019, pp. 6818–6825.
  • [3] W. Jiao, M. R. Lyu, and I. King, “Real-time emotion recognition via attention gated hierarchical memory network,” arXiv preprint arXiv:1911.09075, 2019.
  • [4] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency, “Context-dependent sentiment analysis in user-generated videos,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 873–883.
  • [5] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 2122–2132.
  • [6] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation,” arXiv preprint arXiv:1908.11540, 2019.
  • [7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [8] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014.
  • [9] Z. Wang, S. Ho, and E. Cambria, “A review of emotion sensing: Categorization models and algorithms,” Multimedia Tools and Applications, 2020.
  • [10] N. Howard and E. Cambria, “Intention awareness: Improving upon situation awareness in human-centric environments,” Human-centric Computing and Information Sciences, vol. 3, no. 9, 2013.
  • [11] X. Chen, M. D. Sykora, T. W. Jackson, and S. Elayan, “What about mood swings: Identifying depression on twitter with temporal measures of emotions,” in Companion Proceedings of the The Web Conference 2018.   International World Wide Web Conferences Steering Committee, 2018, pp. 1653–1660.
  • [12] S. Ji, S. Pan, X. Li, E. Cambria, G. Long, and Z. Huang, “Suicidal ideation detection: A review of machine learning methods and applications,” arXiv preprint arXiv:1910.12611, 2020.
  • [13] E. Cambria, “Affective computing and sentiment analysis,” IEEE Intelligent Systems, vol. 31, no. 2, pp. 102–107, 2016.
  • [14] W. Li, K. Guo, Y. Shi, L. Zhu, and Y. Zheng, “Dwwp: Domain-specific new words detection and word propagation system for sentiment analysis in the tourism domain,” Knowledge-Based Systems, vol. 146, pp. 203–214, 2018.
  • [15] F. Xing, E. Cambria, and R. Welsch, “Natural language based financial forecasting: A survey,” Artificial Intelligence Review, vol. 50, no. 1, pp. 49–73, 2018.
  • [16] E. Cambria, M. Grassi, A. Hussain, and C. Havasi, “Sentic computing for social media marketing,” Multimedia Tools and Applications, vol. 59, no. 2, pp. 557–577, 2012.
  • [17] A. Bandhakavi, N. Wiratunga, D. Padmanabhan, and S. Massie, “Lexicon based feature extraction for emotion text classification,” Pattern recognition letters, vol. 93, pp. 133–142, 2017.
  • [18] E. Cambria, S. Poria, D. Hazarika, and K. Kwok, “SenticNet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings,” in AAAI, 2018, pp. 1795–1802.
  • [19] M. S. Akhtar, A. Ekbal, and E. Cambria, “How intense are you? predicting intensities of emotions and sentiments using stacked ensemble,” IEEE Computational Intelligence Magazine, vol. 15, no. 1, pp. 64–75, 2020.
  • [20] W. Ragheb, J. Azé, S. Bringay, and M. Servajean, “Attention-based modeling for emotion detection and classification in textual conversations,” arXiv preprint arXiv:1906.07020, 2019.
  • [21] S. Sukhbaatar, J. Weston, R. Fergus et al., “End-to-end memory networks,” in Advances in neural information processing systems, 2015, pp. 2440–2448.
  • [22] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “ICON: Interactive conversational memory network for multimodal emotion detection,” in

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , 2018, pp. 2594–2604.
  • [23] S. Wang, G. Peng, Z. Zheng, and Z. Xu, “Capturing emotion distribution for multimedia emotion tagging,” IEEE Transactions on Affective Computing, 2019.
  • [24] R. Socher, D. Chen, C. D. Manning, and A. Ng, “Reasoning with neural tensor networks for knowledge base completion,” in Advances in neural information processing systems, 2013, pp. 926–934.
  • [25] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.
  • [26] N. Majumder, S. Poria, H. Peng, N. Chhaya, E. Cambria, and A. Gelbukh, “Sentiment and sarcasm classification with multitask learning,” IEEE Intelligent Systems, vol. 34, no. 3, pp. 38–43, 2019.
  • [27] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
  • [28]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the 27th international conference on machine learning (ICML-10)

    , 2010, pp. 807–814.
  • [29] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural networks, vol. 18, no. 5-6, pp. 602–610, 2005.
  • [30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [31] B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic, “AVEC 2012: the continuous audio/visual emotion challenge,” in Proceedings of the 14th ACM international conference on Multimodal interaction.   ACM, 2012, pp. 449–456.
  • [32] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in ACL, 2019, pp. 527–536.
  • [33] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 5–17, 2011.
  • [34] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
  • [35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [36] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
  • [37] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.