Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

06/16/2018 ∙ by N. Majumder, et al. ∙ 0

Multimodal sentiment analysis is a very actively growing field of research. A promising area of opportunity in this field is to improve the multimodal fusion mechanism. We present a novel feature fusion strategy that proceeds in a hierarchical fashion, first fusing the modalities two in two and only then fusing all three modalities. On multimodal sentiment analysis of individual utterances, our strategy outperforms conventional concatenation of features by 1 sentiment analysis of multi-utterance video clips, for which current state-of-the-art techniques incorporate contextual information from other utterances of the same clip, our hierarchical fusion gives up to 2.4 10 of our method is publicly available in the form of open-source code.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

On numerous social media platforms, such as YouTube, Facebook, or Instagram, people share their opinions on all kinds of topics in the form of posts, images, and video clips. With the proliferation of smartphones and tablets, which has greatly boosted content sharing, people increasingly share their opinions on newly released products or on other topics in form of video reviews or comments. This is an excellent opportunity for large companies to capitalize on, by extracting user sentiment, suggestions, and complaints on their products from these video reviews. This information also opens new horizons to improving our quality of life by making informed decisions on the choice of products we buy, services we use, places we visit, or movies we watch basing on the experience and opinions of other users.

Videos convey information through three channels: audio, video, and text (in the form of speech). Mining opinions from this plethora of multimodal data calls for a solid multimodal sentiment analysis technology. One of the major problems faced in multimodal sentiment analysis is the fusion of features pertaining to different modalities. For this, the majority of the recent works in multimodal sentiment analysis have simply concatenated the feature vectors of different modalities. However, this does not take into account that different modalities may carry conflicting information. We hypothesize that the fusion method we present in this paper deals with this issue better, and present experimental evidence showing improvement over simple concatenation of feature vectors. Also, following the state of the art 

(Poria et al., 2017a)

, we employ recurrent neural network (RNN) to propagate contextual information between utterances in a video clip, which significantly improves the classification results and outperforms the state of the art by a significant margin of 1–2% for all the modality combinations.

In our method, we first obtain unimodal features for each utterance for all three modalities. Then, using RNN we extract context-aware utterance features. Thus, we transform the context-aware utterance vectors to the vectors of the same dimensionality. We assume that these transformed vectors contain abstract features representing the attributes relevant to sentiment classification. Next, we compare and combine each bimodal combination of these abstract features using fully-connected layers. This yields fused bimodal feature vectors. Similarly to the unimodal case, we use RNN to generate context-aware features. Finally, we combine these bimodal vectors into a trimodal vector using, again, fully-connected layers and use a RNN to pass contextual information between them. We empirically show that the feature vectors obtained in this manner are more useful for the sentiment classification task.

The implementation of our method is publicly available in the form of open-source code.111

This paper is structured as follows: Section 2 briefly discusses important previous work in multimodal feature fusion; Section 3 describes our method in details; Section 4 reports the results of our experiments and discuss their implications; finally, Section 5 concludes the paper and discusses future work.

2 Related Work

In recent years, sentiment analysis has become increasingly popular for processing social media data on online communities, blogs, wikis, microblogging platforms, and other online collaborative media (Cambria, 2016). Sentiment analysis is a branch of affective computing research (Poria et al., 2017b)

that aims to classify text – but sometimes also audio and video 

(Hazarika et al., 2018) – into either positive or negative – but sometimes also neutral (Chaturvedi et al., 2018a). Most of the literature is on English language but recently an increasing number of works are tackling the multilinguality issue (Lo et al., 2017; Dashtipour et al., 2016), especially in booming online languages such as Chinese (Peng et al., 2018)

. Sentiment analysis techniques can be broadly categorized into symbolic and sub-symbolic approaches: the former include the use of lexicons 

(Bandhakavi et al., 2017), ontologies (Dragoni et al., 2018), and semantic networks (Cambria et al., 2018) to encode the polarity associated with words and multiword expressions; the latter consist of supervised (Oneto et al., 2016), semi-supervised (Hussain and Cambria, 2018) and unsupervised (Li et al., 2017)machine learning techniques that perform sentiment classification based on word co-occurrence frequencies. Among these, the most popular recently are algorithms based on deep neural networks (Young et al., 2018a) and generative adversarial networks (Li et al., 2018).

While most works approach it as a simple categorization problem, sentiment analysis is actually a suitcase research problem (Cambria et al., 2017) that requires tackling many NLP tasks, including word polarity disambiguation (Xia et al., 2015), subjectivity detection (Chaturvedi et al., 2018b), personality recognition (Majumder et al., 2017), microtext normalization (Satapathy et al., 2017), concept extraction (Rajagopal et al., 2013), time tagging (Zhong et al., 2017), and aspect extraction (Ma et al., 2018).

Sentiment analysis has raised growing interest both within the scientific community, leading to many exciting open challenges, as well as in the business world, due to the remarkable benefits to be had from financial (Xing et al., 2018) and political (Ebrahimi et al., 2017) forecasting, e-health (Cambria et al., 2010) and e-tourism (Valdivia et al., 2017), user profiling (Mihalcea and Garimella, 2016) and community detection (Cavallari et al., 2017), manufacturing and supply chain applications (Xu et al., 2017), human communication comprehension (Zadeh et al., 2018) and dialogue systems (Young et al., 2018b), etc.

In the field of emotion recognition, early works by De Silva et al. (1997) and Chen et al. (1998) showed that fusion of audio and visual systems, creating a bimodal signal, yielded a higher accuracy than any unimodal system. Such fusion has been analyzed at both feature level (Kessous et al., 2010) and decision level (Schuller, 2011).

Although there is much work done on audio-visual fusion for emotion recognition, exploring contribution of text along with audio and visual modalities in multimodal emotion detection has been little explored. Wollmer et al. (2013) and Rozgic et al. (2012) fused information from audio, visual and textual modalities to extract emotion and sentiment. Metallinou et al. (2008) and Eyben et al. (2010a) fused audio and textual modalities for emotion recognition. Both approaches relied on a feature-level fusion. Wu and Liang (2011) fused audio and textual clues at decision level. Poria et al. (2016)

uses convolutional neural network (CNN) to extract features from the modalities and then employs multiple-kernel learning (MKL) for sentiment analysis. The current state of the art, set forth by

Poria et al. (2017a)

, extracts contextual information from the surrounding utterances using long short-term memory (LSTM).

Poria et al. (2017b)

fuses different modalities with deep learning-based tools.

Zadeh et al. (2017)

uses tensor fusion.

Poria et al. (2017c) further extends upon the ensemble of CNN and MKL.

Unlike existing approaches, which use simple concatenation based early fusion (Poria et al., 2016, 2015) and non-trainable tensors based fusion (Zadeh et al., 2017), this work proposes a hierarchical fusion capable of learning the bimodal and trimodal correlations for data fusion using deep neural networks. The method is end-to-end and, in order to accomplish the fusion, it can be plugged into any deep neural network based multimodal sentiment analysis framework.

3 Our Method

In this section, we discuss our novel methodology behind solving the sentiment classification problem. First we discuss the overview of our method and then we discuss the whole method in details, step by step.

3.1 Overview

3.1.1 Unimodal Feature Extraction

We extract utterance-level features for three modalities. This step is discussed in Section 3.2.

3.1.2 Multimodal Fusion

Problems of early fusion

The majority of the work on multimodal data use concatenation, or early fusion (Fig. 1), as their fusion strategy. The problem with this simplistic approach is that it cannot filter out and conflicting or redundant information obtained from different modalities. To address this major issue, we devise an hierarchical approach which proceeds from unimodal to bimodal vectors and then bimodal to trimodal vectors.

Figure 1: Utterance-level early fusion, or simple concatenation
Bimodal fusion

We fuse the utterance feature vectors for each bimodal combination, i.e., T+V, T+A, and A+V. This step is depicted in Fig. 2 and discussed in details in Section 3.4. We use the penultimate layer for Fig. 2 as bimodal features.

Figure 2: Utterance-level bimodal fusion
Trimodal fusion

We fuse the three bimodal features to obtain trimodal feature as depicted in Fig. 3. This step is discussed in details in Section 3.4.

Figure 3: Utterance-level trimodal hierarchical fusion.333Figure adapted from (Majumder, 2017) with permission.
Addition of context

We also improve the quality of feature vectors (both unimodal and multimodal) by incorporating information from surrounding utterances using RNN. We model the context using gated recurrent unit (GRU) as depicted in

Fig. 4. The details of context modeling is discussed in Section 3.3 and the following subsections.

Figure 4: Context-aware hierarchical fusion

We classify the feature vectors using a softmax layer.

3.2 Unimodal Feature Extraction

In this section, we discuss the method of feature extraction for three different modalities: audio, video, and text.

3.2.1 Textual Feature Extraction

The textual data is obtained from the transcripts of the videos. We apply a deep Convolutional Neural Networks (CNN) (Karpathy et al., 2014) on each utterance to extract textual features. Each utterance in the text is represented as an array of pre-trained 300-dimensional word2vec vectors (Mikolov et al., 2013)

. Further, the utterances are truncated or padded with null vectors to have exactly 50 words.

Next, these utterances as array of vectors are passed through two different convolutional layers; first layer having two filters of size 3 and 4 respectively with 50 feature maps each and the second layer has a filter of size 2 with 100 feature maps. Each convolutional layer is followed by a max-pooling layer with window


The output of the second max-pooling layer is fed to a fully-connected layer with 500 neurons with a rectified linear unit (ReLU

(Teh and Hinton, 2001) activation, followed by softmax output. The output of the penultimate fully-connected layer is used as the textual feature. The translation of convolution filter over makes the CNN learn abstract features and with each subsequent layer the context of the features expands further.

3.2.2 Audio Feature Extraction

The audio feature extraction process is performed at 30 Hz frame rate with 100 ms sliding window. We use openSMILE (Eyben et al., 2010b)

, which is capable of automatic pitch and voice intensity extraction, for audio feature extraction. Prior to feature extraction audio signals are processed with voice intensity thresholding and voice normalization. Specifically, we use Z-standardization for voice normalization. In order to filter out audio segments without voice, we threshold voice intensity. OpenSMILE is used to perform both these steps. Using openSMILE we extract several Low Level Descriptors (LLD) (e.g., pitch , voice intensity) and various statistical functionals of them (e.g., amplitude mean, arithmetic mean, root quadratic mean, standard deviation, flatness, skewness, kurtosis, quartiles, inter-quartile ranges, and linear regression slope). “IS13-ComParE” configuration file of openSMILE is used to for our purposes. Finally, we extracted total 6392 features from each input audio segment.

3.2.3 Visual Feature Extraction

To extract visual features, we focus not only on feature extraction from each video frame but also try to model temporal features across frames. To achieve this, we use 3D-CNN on the video. 3D-CNNs have been successful in the past, specially in the field of object classification on 3D data (Ji et al., 2013). Its state-of-the-art performance on such tasks motivates its use in this paper.

Let the video be called , where represents the three RGB channels of an image and denote the cardinality, height, and width of the frames, respectively. A 3D convolutional filter, named , is applied to this video, where, similar to a 2D-CNN, the filter translates across the video and generates the convolution output . Here, denote number of feature maps, depth of filter, height of filter, and width of filter, respectively. Finally, we apply max-pooling operation to the , which selects the most relevant features. This operation is applied only to the last three dimensions of . This is followed by a dense layer and softmax computation. The activations of this layer is used as the overall video features for each utterance video.

In our experiments, we receive the best results with filter dimensions and . Also, for the max-pooling, we set the window size as and the succeeding dense layer with neurons.

3.3 Context Modeling

Utterances in the videos are semantically dependent on each other. In other words, complete meaning of an utterance may be determined by taking preceding utterances into consideration. We call this the context of an utterance. Following Poria et al. (2017a), we use RNN, specifically GRU444LSTM does not perform well to model semantic dependency among the utterances in a video.

Let the following items represent unimodal features:

where maximum number of utterances in a video. We pad the shorter videos with dummy utterances represented by null vectors of corresponding length. For each modality, we feed the unimodal utterance features (where ) (discussed in Section 3.2) of a video to with output size , which is defined as

where , , , , , , , , , , , , and . This yields hidden outputs as context-aware unimodal features for each modality. Hence, we define , where . Thus, the context-aware multimodal features can be defined as

3.4 Multimodal Fusion

In this section, we use context-aware unimodal features and to a unified feature space.

The unimodal features may have different dimensions, i.e., . Thus, we map them to the same dimension, say (we obtained best results with ), using fully-connected layer as follows:

where , , , , , and . We can represent the mapping for each dimension as

where and are scalars for all and . Also, in the rows represent the utterances and the columns the feature values. We can see these values as more abstract feature values derived from fundamental feature values (which are the components of , , and ). For example, an abstract feature can be the angriness of a speaker in a video. We can infer the degree of angriness from visual features (; facial muscle movements), acoustic features (, such as pitch and raised voice), or textual features (, such as the language and choice of words). Therefore, the degree of angriness can be represented by , where is , , or , is some fixed integer between and , and is some fixed integer between and .

Now, the evaluation of abstract feature values from all the modalities may not have the same merit or may even contradict each other. Hence, we need the network to make comparison among the feature values derived from different modalities to make a more refined evaluation of the degree of anger. To this end, we take each bimodal combination (which are audio–video, audio–text, and video–text) at a time and compare and combine each of their respective abstract feature values (i.e. with , with , and with ) using fully-connected layers as follows:


where , is scalar, , is scalar, , and is scalar, for all and . We hypothesize that it will enable the network to compare the decisions from each modality against the others and help achieve a better fusion of modalities.

Bimodal fusion

Eqs. 1 to 3 are used for bimodal fusion. The bimodal fused features for video–audio, audio–text, video–text are defined as

We further employ Section 3.3) (), to incorporate contextual information among the utterances in a video with


, , and are context-aware bimodal features represented as vectors and is scalar for , , , and .

Trimodal fusion

We combine all three modalities using fully-connected layers as follows:

where and is a scalar for all and . So, we define the fused features as

where , is scalar for and .

Similarly to bimodal fusion (Section 3.4), after trimodal fusion we pass the fused features through to incorporate contextual information in them, which yields

where , is scalar for , , , and is the context-aware trimodal feature vector.

3.5 Classification

In order to perform classification, we feed the fused features (where and ) to a softmax layer with outputs. The classifier can be described as follows:

where , , , class value ( or ), and estimated class value.

3.6 Training

We employ categorical cross-entropy as loss function (

) for training,

where number of samples, index of a sample, class value, and

Adam (Kingma and Ba, 2014)

is used as optimizer due to its ability to adapt learning rate for each parameter individually. We train the network for 200 epochs with early stopping, where we optimize the parameter set

where , , and . Algorithm 1 summarizes our method.555Implementation of this algorithm is available at

1:procedure TrainAndTestModel(, ) = train set, = test set
2:      Unimodal feature extraction:
3:      for i:[1,] do extract baseline features
7:      for m  do
8:             = ()       
9:      Fusion:
10:       dimensionality equalization
13:       bimodal fusion
16:      for m  do
17:             = ()       
18:       trimodal fusion
19:      = ()
20:     for i:[1,] do softmax classification
23:procedure MapToSpace() for modality
25:      return
26:procedure BimodalFusion(, ) for modality and , where
27:      for i:[1,] do
30:      return
31:procedure TrimodalFusion(, , ) for modality combination , , and , where
32:      for i:[1,] do
35:      return
36:procedure TestModel()
37:      Similarly to training phase, is passed through the learnt models to get the features and classification outputs. Section 3.6 mentions the trainable parameters ().
Algorithm 1 Context-Aware Hierarchical Fusion Algorithm

4 Experiments

4.1 Dataset Details

Most research works in multimodal sentiment analysis are performed on datasets where train and test splits may share certain speakers. Since, each individual has an unique way of expressing emotions and sentiments, finding generic and person-independent features for sentiment analysis is crucial. Table 1 shows the train and test split for the datasets used.

Dataset Train Test
pos. neg. happy anger sad neu. pos. neg. happy anger sad neu.
MOSI 709 738 - - - - 467 285 - - - -
IEMOCAP - - 1194 933 839 1324 - - 433 157 238 380
pos. = positive, neg. = negative, neu. = neutral
Table 1: Class distribution of datasets in both train and test splits.

4.1.1 Cmu-Mosi

CMU-MOSI dataset (Zadeh et al., 2016) is rich in sentimental expressions, where 89 people review various topics in English. The videos are segmented into utterances where each utterance is annotated with scores between (strongly negative) and

(strongly positive) by five annotators. We took the average of these five annotations as the sentiment polarity and considered only two classes (positive and negative). Given every individual’s unique way of expressing sentiments, real world applications should be able to model generic person independent features and be robust to person variance. To this end, we perform person-independent experiments to emulate unseen conditions. Our train/test splits of the dataset are completely disjoint with respect to speakers. The train/validation set consists of the first 62 individuals in the dataset. The test set contains opinionated videos by rest of the 31 speakers. In particular, 1447 and 752 utterances are used for training and test respectively.

4.1.2 Iemocap

IEMOCAP (Busso et al., 2008) contains two way conversations among ten speakers, segmented into utterances. The utterances are tagged with the labels anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and other. We consider the first four ones to compare with the state of the art (Poria et al., 2017a) and other works. It contains 1083 angry, 1630 happy, 1083 sad, and 1683 neutral videos. Only the videos by the first eight speakers are considered for training.

4.2 Baselines

We compare our method with the following strong baselines.

Early fusion

We extract unimodal features (Section 3.2

) and simply concatenate them to produce multimodal features. Followed by support vector machine (SVM) being applied on this feature vector for the final sentiment classification.

Method from (Poria et al., 2016)

We have implemented and compared our method with the approach proposed by Poria et al. (2016). In their approach, they extracted visual features using CLM-Z, audio features using openSMILE, and textual features using CNN. MKL was then applied to the features obtained from concatenation of the unimodal features. However, they did not conduct speaker independent experiments.

In order to perform a fair comparison with (Poria et al., 2016), we employ our fusion method on the features extracted by Poria et al. (2016).

Method from (Poria et al., 2017a)

We have compared our method with (Poria et al., 2015), which takes advantage of contextual information obtained from the surrounding utterances. This context modeling is achieved using LSTM. We reran the experiments of Poria et al. (2015) without using SVM for classification since using SVM with neural networks is usually discouraged. This provides a fair comparison with our model which does not use SVM.

Method from (Zadeh et al., 2017)

In (Zadeh et al., 2017), they proposed a trimodal fusion method based on the tensors. We have also compared our method with their. In particular, their dataset configuration was different than us so we have adapted their publicly available code  666 and employed that on our dataset.

4.3 Experimental Setting

We considered two variants of experimental setup while evaluating our model.


In this setup, we evaluated hierarchical fusion without context-aware features with CMU-MOSI dataset. We removed all the GRUs from the model described in Sections 3.3 and 3.4 forwarded utterance specific features directly to the next layer. This setup is depicted in Fig. 3.


This setup is exactly as the model described in Section 3.

4.4 Results and Discussion

We discuss the results for the different experimental settings discussed in Section 4.3.

Modality Combination (Poria et al., 2016) feature set Our feature set
HFusion Early fusion HFusion
T N/A 75.0%
V N/A 55.3%
A N/A 56.9%
T+V 73.2% 73.8% 74.4% 77.1% 77.4% 77.8%
T+A 73.2% 73.5% 74.2% 77.1% 76.3% 77.3%
A+V 55.7% 56.2% 57.5% 56.5% 56.1% 56.8%
A+V+T 73.5% 71.2% 74.6% 77.0% 77.3% 77.9%
Table 2: Comparison in terms of accuracy of Hierarchical Fusion (HFusion) with other fusion methods for CMU-MOSI dataset; bold font signifies best accuracy for the corresponding feature set and modality or modalities, where T stands for text, V for video, and A for audio. = Poria et al. (Poria et al., 2016), = Zadeh et al. (Zadeh et al., 2017)

4.4.1 Hierarchical Fusion (HFusion)

The results of our experiments are presented in Table 2. We evaluated this setup with CMU-MOSI dataset (Section 4.1.1) and two feature sets: the feature set used in (Poria et al., 2016) and the set of unimodal features discussed in Section 3.2.

Our model outperformed (Poria et al., 2016), which employed MKL, for all bimodal and trimodal scenarios by a margin of 1–1.8%. This leads us to present two observations. Firstly, the features used in (Poria et al., 2016) are inferior to the features extracted in our approach. Second, our hierarchical fusion method is better than their fusion method.

It is already established in the literature (Poria et al., 2016; Pérez-Rosas et al., 2013) that multimodal analysis outperforms unimodal analysis. We also observe the same trend in our experiments where trimodal and bimodal classifiers outperform unimodal classifiers. The textual modality performed best among others with a higher unimodal classification accuracy of 75%. Although other modalities contribute to improve the performance of multimodal classifiers, that contribution is little in compare to the textual modality.

On the other hand, we compared our model with early fusion (Section 4.2) for aforementioned feature sets (Section 3.2). Our fusion mechanism consistently outperforms early fusion for all combination of modalities. This supports our hypothesis that our hierarchical fusion method captures the inter-relation among the modalities and produce better performance vector than early fusion. Text is the strongest individual modality, and we observe that the text modality paired with remaining two modalities results in consistent performance improvement.

Overall, the results give a strong indication that the comparison among the abstract feature values dampens the effect of less important modalities, which was our hypothesis. For example, we can notice that for early fusion T+V and T+A both yield the same performance. However, with our method text with video performs better than text with audio, which is more aligned with our expectations, since facial muscle movements usually carry more emotional nuances than voice.

In particular, we observe that our model outperformed all the strong baselines mentioned above. The method by (Poria et al., 2016) is only able to fuse using concatenation. Our proposed method outperformed their approach by a significant margin; thanks to the power of hierarchical fusion which proves the capability of our method in modeling bimodal and trimodal correlations. However on the other hand, the method by (Zadeh et al., 2017) is capable of fusing the modalities using a tensor. Interestingly our method also outperformed them and we think the reason is the capability of bimodal fusion and use that for trimodal fusion. Tensor fusion network is incapable to learn the weights of the bimodal and trimodal correlations in the fusion. Tensor Fusion is mathematically formed by an outer product, it has no learn-able parameters. Wherein our method learns the weights automatically using a neural network (Equation 1,2 and 3).

T 76.5% 73.6% -
V 54.9% 53.3% -
A 55.3% 57.1% -
T+V 77.8% 77.1% 79.3% 74.1% 73.7% 75.9% 75.6%
T+A 77.3% 77.0% 79.1% 73.7% 71.1% 76.1% 76.0%
A+V 57.9% 56.5% 58.8% 68.4% 67.4% 69.5% 69.6%
A+V+T 78.7% 77.2% 80.0% 74.1% 73.6% 76.5% 76.8%
Table 3: Comparison of Context-Aware Hierarchical Fusion (CHFusion) in terms of accuracy (

) and f-score (for IEMOCAP:

) with the state of the art for CMU-MOSI and IEMOCAP dataset; bold font signifies best accuracy for the corresponding dataset and modality or modalities, where T stands text, V for video, A for audio. = Poria et al. (Poria et al., 2016), = Zadeh et al. (Zadeh et al., 2017). and are the accuracy and f-score of CHFusion respectively.

4.4.2 Context-Aware Hierarchical Fusion (CHFusion)

The results of this experiment are shown in Table 3. This setting fully utilizes the model described in Section 3. We applied this experimental setting for two datasets, namely CMU-MOSI (Section 4.1.1) and IEMOCAP (Section 4.1.2). We used the feature set discussed in Section 3.2, which was also used by Poria et al. (2017a). As expected our method outperformed the simple early fusion based fusion by (Poria et al., 2016), tensor fusion by (Zadeh et al., 2017). The method by Poria et al. (2017a) used a scheme to learn contextual features from the surrounding features. However, as a method of fusion they adapted simple concatenation based fusion method by (Poria et al., 2016). As discussed in Section 3.3, we employed their contextual feature extraction framework and integrated our proposed fusion method to that. This has helped us to outperform Poria et al. (2017a) by significant margin thanks to the hierarchical fusion (HFusion).


We achieve 1–2% performance improvement over the state of the art (Poria et al., 2017a) for all the modality combinations having textual component. For A+V modality combination we achieve better but similar performance to the state of the art. We suspect that it is due to both audio and video modality being significantly less informative than textual modality. It is evident from the unimodal performance where we observe that textual modality on its own performs around 21% better than both audio and video modality. Also, audio and video modality performs close to majority baseline. On the other hand, it is important to notice that with all modalities combined we achieve about 3.5% higher accuracy than text alone.

For example, consider the following utterance: so overall new moon even with the bigger better budgets huh it was still too long. The speaker discusses her opinion on the movie Twilight New Moon. Textually the utterance is abundant with positive words however audio and video comprises of a frown which is observed by the hierarchical fusion based model.


As the IEMOCAP dataset contains four distinct emotion categories, in the last layer of the network we used a softmax classifier whose output dimension is set to 4. In order to perform classification on IEMOCAP dataset we feed the fused features (where and ) to a softmax layer with outputs. The classifier can be described as follows:

where , , , class value ( or or or ), and estimated class value.

Metrics Classes
Happy Sad Neutral Anger
Accuracy 74.3 75.6 78.4 79.6
F-Score 81.4 77.0 71.2 77.6
Table 4: Class-wise accuracy and f-score for IEMOCAP dataset for trimodal scenario.

Here as well, we achieve performance improvement consistent with CMU-MOSI. This method performs 1–2.4% better than the state of the art for all the modality combinations. Also, trimodal accuracy is 3% higher than the same for textual modality. Since, IEMOCAP dataset imbalanced, we also present the f-score for each modality combination for a better evaluation. One key observation for IEMOCAP dataset is that its A+V modality combination performs significantly better than the same of CMU-MOSI dataset. We think that this is due to the audio and video modality of IEMOCAP being richer than the same of CMU-MOSI. The performance difference with another strong baseline (Zadeh et al., 2017) is even more ranging from 2.1% to 3% on CMU-MOSI dataset and 2.2% to 5% on IEMOCAP dataset. This again confirms the superiority of the hierarchical fusion in compare to (Zadeh et al., 2017). We think this is mainly because of learning the weights of bimodal and trimodal correlation (representing the degree of correlations) calculations at the time of fusion while Tensor Fusion Network (TFN) just relies on the non-trainable outer product of tensors to model such correlations for fusion. Additionally, we present class-wise accuracy and f-score for IEMOCAP for trimodal (A+V+T) scenario in Table 4.

4.4.3 HFusion vs. CHFusion

We compare HFusion and CHFusion models over CMU-MOSI dataset. We observe that CHFusion performs 1–2% better than HFusion model for all the modality combinations. This performance boost is achieved by the inclusion of utterance-level contextual information in HFusion model by adding GRUs in different levels of fusion hierarchy.

5 Conclusion

Multimodal fusion strategy is an important issue in multimodal sentiment analysis. However, little work has been done so far in this direction. In this paper, we have presented a novel and comprehensive fusion strategy. Our method outperforms the widely used early fusion on both datasets typically used to test multimodal sentiment analysis methods. Moreover, with the addition of context modeling with GRU, our method outperforms the state of the art in multimodal sentiment analysis and emotion detection by significant margin.

In our future work, we plan to improve the quality of unimodal features, especially textual features, which will further improve the accuracy of classification. We will also experiment with more sophisticated network architectures.


The work was partially supported by the Instituto Politécnico Nacional via grant SIP 20172008 to A. Gelbukh.


  • Poria et al. (2017a) S. Poria, E. Cambria, D. Hazarika, N. Mazumder, A. Zadeh, L.-P. Morency, Context-dependent sentiment analysis in user-generated videos, in: ACL, 873–883, 2017a.
  • Cambria (2016) E. Cambria, Affective Computing and Sentiment Analysis, IEEE Intelligent Systems 31 (2) (2016) 102–107.
  • Poria et al. (2017b) S. Poria, E. Cambria, R. Bajpai, A. Hussain, A review of affective computing: From unimodal analysis to multimodal fusion, Information Fusion 37 (2017b) 98–125.
  • Hazarika et al. (2018) D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, R. Zimmermann, Conversational memory network for emotion recognition in dyadic dialogue videos, in: NAACL, 2122–2132, 2018.
  • Chaturvedi et al. (2018a) I. Chaturvedi, E. Cambria, R. Welsch, F. Herrera, Distinguishing between facts and opinions for sentiment analysis: Survey and challenges, Information Fusion 44 (2018a) 65–77.
  • Lo et al. (2017)

    S. L. Lo, E. Cambria, R. Chiong, D. Cornforth, Multilingual sentiment analysis: From formal to informal and scarce resource languages, Artificial Intelligence Review 48 (4) (2017) 499–527.

  • Dashtipour et al. (2016) K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A. Y. Hawalah, A. Gelbukh, Q. Zhou, Multilingual sentiment analysis: state of the art and independent comparison of techniques, Cognitive computation 8 (4) (2016) 757–771.
  • Peng et al. (2018) H. Peng, Y. Ma, Y. Li, E. Cambria, Learning Multi-grained Aspect Target Sequence for Chinese Sentiment Analysis, in: Knowledge-Based Systems, vol. 148, 167–176, 2018.
  • Bandhakavi et al. (2017) A. Bandhakavi, N. Wiratunga, S. Massie, P. Deepak, Lexicon generation for emotion analysis of text, IEEE Intelligent Systems 32 (1) (2017) 102–108.
  • Dragoni et al. (2018) M. Dragoni, S. Poria, E. Cambria, OntoSenticNet: A Commonsense Ontology for Sentiment Analysis, IEEE Intelligent Systems 33 (3).
  • Cambria et al. (2018) E. Cambria, S. Poria, D. Hazarika, K. Kwok, SenticNet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings, in: AAAI, 1795–1802, 2018.
  • Oneto et al. (2016)

    L. Oneto, F. Bisio, E. Cambria, D. Anguita, Statistical learning theory and ELM for big social data analysis, IEEE Computational Intelligence Magazine 11 (3) (2016) 45–55.

  • Hussain and Cambria (2018)

    A. Hussain, E. Cambria, Semi-supervised learning for big social data analysis, Neurocomputing 275 (2018) 1662–1673.

  • Li et al. (2017) Y. Li, Q. Pan, T. Yang, S. Wang, J. Tang, E. Cambria, Learning word representations for sentiment analysis, Cognitive Computation 9 (6) (2017) 843–851.
  • Young et al. (2018a)

    T. Young, D. Hazarika, S. Poria, E. Cambria, Recent trends in deep learning based natural language processing, IEEE Computational Intelligence Magazine 13 (3).

  • Li et al. (2018)

    Y. Li, Q. Pan, S. Wang, T. Yang, E. Cambria, A Generative Model for category text generation, Information Sciences 450 (2018) 301–315.

  • Cambria et al. (2017) E. Cambria, S. Poria, A. Gelbukh, M. Thelwall, Sentiment Analysis is a Big Suitcase, IEEE Intelligent Systems 32 (6) (2017) 74–80.
  • Xia et al. (2015) Y. Xia, E. Cambria, A. Hussain, H. Zhao, Word Polarity Disambiguation Using Bayesian Model and Opinion-Level Features, Cognitive Computation 7 (3) (2015) 369–380.
  • Chaturvedi et al. (2018b)

    I. Chaturvedi, E. Ragusa, P. Gastaldo, R. Zunino, E. Cambria, Bayesian network based extreme learning machine for subjectivity detection, Journal of The Franklin Institute 355 (4) (2018b) 1780–1797.

  • Majumder et al. (2017) N. Majumder, S. Poria, A. Gelbukh, E. Cambria, Deep learning-based document modeling for personality detection from text, IEEE Intelligent Systems 32 (2) (2017) 74–79.
  • Satapathy et al. (2017) R. Satapathy, C. Guerreiro, I. Chaturvedi, E. Cambria, Phonetic-Based Microtext Normalization for Twitter Sentiment Analysis, in: ICDM, 407–413, 2017.
  • Rajagopal et al. (2013) D. Rajagopal, E. Cambria, D. Olsher, K. Kwok, A graph-based approach to commonsense concept extraction and semantic similarity detection, in: WWW, 565–570, 2013.
  • Zhong et al. (2017)

    X. Zhong, A. Sun, E. Cambria, Time expression analysis and recognition using syntactic token types and general heuristic rules, in: ACL, 420–429, 2017.

  • Ma et al. (2018) Y. Ma, H. Peng, E. Cambria, Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM, in: AAAI, 5876–5883, 2018.
  • Xing et al. (2018) F. Xing, E. Cambria, R. Welsch, Natural Language Based Financial Forecasting: A Survey, Artificial Intelligence Review 50 (1) (2018) 49–73.
  • Ebrahimi et al. (2017) M. Ebrahimi, A. Hossein, A. Sheth, Challenges of sentiment analysis for dynamic events, IEEE Intelligent Systems 32 (5) (2017) 70–75.
  • Cambria et al. (2010) E. Cambria, A. Hussain, T. Durrani, C. Havasi, C. Eckl, J. Munro, Sentic Computing for Patient Centered Application, in: IEEE ICSP, 1279–1282, 2010.
  • Valdivia et al. (2017) A. Valdivia, V. Luzon, F. Herrera, Sentiment analysis in TripAdvisor, IEEE Intelligent Systems 32 (4) (2017) 72–77.
  • Mihalcea and Garimella (2016) R. Mihalcea, A. Garimella, What Men Say, What Women Hear: Finding Gender-Specific Meaning Shades, IEEE Intelligent Systems 31 (4) (2016) 62–67.
  • Cavallari et al. (2017) S. Cavallari, V. Zheng, H. Cai, K. Chang, E. Cambria, Learning community embedding with community detection and node embedding on graphs, in: CIKM, 377–386, 2017.
  • Xu et al. (2017)

    C. Xu, E. Cambria, P. S. Tan, Adaptive Two-Stage Feature Selection for Sentiment Classification, in: IEEE SMC, 1238–1243, 2017.

  • Zadeh et al. (2018) A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, L.-P. Morency, Multi-attention recurrent network for human communication comprehension, in: AAAI, 5642–5649, 2018.
  • Young et al. (2018b) T. Young, E. Cambria, I. Chaturvedi, H. Zhou, S. Biswas, M. Huang, Augmenting End-to-End Dialogue Systems with Commonsense Knowledge, in: AAAI, 4970–4977, 2018b.
  • De Silva et al. (1997) L. C. De Silva, T. Miyasato, R. Nakatsu, Facial emotion recognition using multi-modal information, in: Proceedings of ICICS, vol. 1, IEEE, 397–401, 1997.
  • Chen et al. (1998) L. S. Chen, T. S. Huang, T. Miyasato, R. Nakatsu, Multimodal human emotion/expression recognition, in: Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, IEEE, 366–371, 1998.
  • Kessous et al. (2010) L. Kessous, G. Castellano, G. Caridakis, Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis, Journal on Multimodal User Interfaces 3 (1-2) (2010) 33–48.
  • Schuller (2011) B. Schuller, Recognizing affect from linguistic information in 3D continuous space, IEEE Transactions on Affective Computing 2 (4) (2011) 192–205.
  • Wollmer et al. (2013) M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, L.-P. Morency, Youtube movie reviews: Sentiment analysis in an audio-visual context, IEEE Intelligent Systems 28 (3) (2013) 46–53.
  • Rozgic et al. (2012) V. Rozgic, S. Ananthakrishnan, S. Saleem, R. Kumar, R. Prasad, Ensemble of SVM trees for multimodal emotion recognition, in: Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, IEEE, 1–4, 2012.
  • Metallinou et al. (2008)

    A. Metallinou, S. Lee, S. Narayanan, Audio-visual emotion recognition using gaussian mixture models for face and voice, in: Tenth IEEE International Symposium on ISM 2008, IEEE, 250–257, 2008.

  • Eyben et al. (2010a) F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, R. Cowie, On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues, Journal on Multimodal User Interfaces 3 (1-2) (2010a) 7–19.
  • Wu and Liang (2011) C.-H. Wu, W.-B. Liang, Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels, IEEE Transactions on Affective Computing 2 (1) (2011) 10–21.
  • Poria et al. (2016) S. Poria, I. Chaturvedi, E. Cambria, A. Hussain, Convolutional MKL based multimodal emotion recognition and sentiment analysis, in: ICDM, Barcelona, 439–448, 2016.
  • Zadeh et al. (2017) A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency, Tensor Fusion Network for Multimodal Sentiment Analysis, in: EMNLP, 1114–1125, 2017.
  • Poria et al. (2017c) S. Poria, H. Peng, A. Hussain, N. Howard, E. Cambria, Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis, Neurocomputing 261 (2017c) 217–230.
  • Poria et al. (2015) S. Poria, E. Cambria, A. Gelbukh, Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis, in: EMNLP, 2539–2544, 2015.
  • Karpathy et al. (2014)

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732, 2014.

  • Mikolov et al. (2013) T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 .
  • Teh and Hinton (2001)

    V. Teh, G. E. Hinton, Rate-coded restricted Boltzmann machines for face recognition, in: T. Leen, T. Dietterich, V. Tresp (Eds.), Advances in neural information processing system, vol. 13, 908–914, 2001.

  • Eyben et al. (2010b) F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the Munich versatile and fast open-source audio feature extractor, in: Proceedings of the 18th ACM international conference on Multimedia, ACM, 1459–1462, 2010b.
  • Ji et al. (2013) S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition, IEEE transactions on pattern analysis and machine intelligence 35 (1) (2013) 221–231.
  • Majumder (2017) N. Majumder, Multimodal Sentiment Analysis in Social Media using Deep Learning with Convolutional Neural Networks, Master’s thesis, CIC, Instituto Politécnico Nacional, 2017.
  • Kingma and Ba (2014) D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, CoRR abs/1412.6980, URL
  • Zadeh et al. (2016) A. Zadeh, R. Zellers, E. Pincus, L.-P. Morency, Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages, IEEE Intelligent Systems 31 (6) (2016) 82–88.
  • Busso et al. (2008) C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan, IEMOCAP: Interactive emotional dyadic motion capture database, Language resources and evaluation 42 (4) (2008) 335–359.
  • Pérez-Rosas et al. (2013) V. Pérez-Rosas, R. Mihalcea, L.-P. Morency, Utterance-Level Multimodal Sentiment Analysis, in: ACL (1), 973–982, 2013.