Multimodal sentiment analysis has received significant traction in recent years, due to its ability to understand the opinions expressed in the increasing number of videos available on open platforms such as YouTube, Facebook, Vimeo, and others. This is important, as more and more enterprises tend to make business decisions based on the user sentiment behind their products as expressed through these videos.
Multimodal fusion is considered a key step in multimodal sentiment analysis. Most recent work on multimodal fusion poria-EtAl:2017:Long; AAAI1817390 has focused on the strategy of obtaining a multimodal representation from the independent unimodal representations. Our approach takes this strategy one step further, by also requiring that the original unimodal representations be reconstructed from the unified multimodal representation. The motivation behind this is the intuition that different modalities are an expression of the state of the mind. Hence, if we assume that the fused representation is the mind-state/sentiment/emotion, then in our approach we are ensuring that the fused representation can be mapped back to the unimodal representations, which should improve the quality of the multimodal representation. In this paper, we empirically argue that this is the case by showing that this approach outperforms the state-of-the-art in multimodal fusion.
We employ a variational autoencoder (VAE) DBLP:journals/corr/KingmaW13, where the encoder network generates a latent representation from the unimodal representations. Further, the decoder network decodes the unimodal representations from the latent representation to the original unimodal representation. This latent representation is treated as the multimodal representation for the final classification.
2 Related Work
rozgic2012ensemble and wollmer2013youtube were the first to fuse acoustic, visual, and text modalities for sentiment and emotion detection. Later, poria2015deep employed CNN and multi-kernel learning for multimodal sentiment analysis. Further, poria-EtAl:2017:Long
used long short-term memory (LSTM) to enable context-dependent multimodal fusion, where the surrounding utterances are taken into account for context.
Recently, for context-free setting where the surrounding utterances are not used as context, zadeh-EtAl:2017:EMNLP2017
used tensor outer-products to model intra- and inter-modal interactions. Again,AAAI1817341 used multi-view learning for utterance-level multimodal fusion. Further, AAAI1817390 employed hybrid LSTM memory components to model intra-modal and cross-modal interactions.
Usually humans express their thoughts through three perceivable modalities - textual (speech), acoustic (pitch and other properties of voice), and visual (facial expression). Where most recent works on multimodal fusion treat these unimodal representations independently and employ an encoder network to obtain the fused representation vector, we go one step further by decoding the fused-multimodal representation into the original unimodal representations.
First the utterance-level unimodal features are extracted independently. Then, the modality features are fed to encoder network to sample the fused representation. Further, the fused representation is decoded back to the unimodal representations to ensure the fidelity of the fused representation. This setup is basically an autoencoder. Specifically, we employ a variational autoencoder (VAE) DBLP:journals/corr/KingmaW13, as described in Fig. 1, where the latent representation is used as the fused representation.
3.1 Unimodal Feature Extraction
Textual (), visual (), and acoustic () features are extracted using CNN, 3D-CNN tran2015learning, and OpenSmile eyben2015opensmile respectively, with the methodology described by poria-EtAl:2017:Long.
The encoder takes the concatenation of the unimodal features of an utterance as input, where is textual feature of size , is acoustic feature of size , and is visual feature of size , and infers the latent multimodal representation of size from the posterior distribution , such that
Since, the true posterior distribution is intractable, is fed through two fully-connected layers to generate mean (
) and standard deviation (
) of the approximate posterior normal distribution, which infers the latent representation :
where , , , , , , , and .
Sampling Latent (Multimodal) Representation
The latent representation is sampled using the reparameterization trick DBLP:journals/corr/KingmaW13
to facilitate backpropagation:
where , , and represents hadamard product. This is considered as the fused multimodal representation.
The decoder reconstructs the input as from the latent representation with two fully-connected layers as follows:
where , , , , , and .
We tried two different classification networks:
Logistic Regression (LR)
We employ a fully-connected layer with softmax activation where the fused representation is fed:
where , ,
is the vector of class-probabilities,is the predicted class, and is the number of classes ( in our case).
Context-Dependent Classifier (bc-LSTM poria-EtAl:2017:Long)
The sequence of fused utterance representations () in a video is fed to a bidirectional-LSTM hochreiter1997long, following poria-EtAl:2017:Long, of size for context propagation and then the output of the LSTM is fed to a fully-connected layer with softmax activation for classification:
where is the sequence of fused utterance representations in a video with utterances, is the context-dependent fused representations of the utterances (), , , is the vector of class-probabilities for utterance , is the predicted class for utterance , and is the number of classes (e.g. for MOSI dataset (Section 4.1)).
Latent Representation Inference
Following DBLP:journals/corr/KingmaW13, the approximate posterior distribution is tuned close to the true posterior by maximizing the evidence lower bound (ELBO), where
The first term of the ELBO, , corresponds to the reconstruction loss of input . The second term, , pushes the approximate posterior close to the prior by minimizing the KL-divergence between them.
To train the sentiment classifier (Section 3.4), we minimize the categorical cross-entropy (), defined as
where is the number of samples,
is the probability distribution for sampleon different classes (for our experiments, we use two classes; positive and negative), and is the target class for sample .
4 Experimental Settings
We evaluate the quality of the multimodal features extracted by VAE111implementation available at https://github.com/xxxx/xxxx/ (will be releaved upon acceptance) using two classifiers (Section 3.4). Hence, the two variants are named VAE+LR and VAE+bc-LSTM in Table 2.
We evaluate our approach on three different datasets.
This dataset contains videos of 89 people reviewing various topics in English. The videos are segmented into utterances where each utterance is annotated with sentiment tags (positive/negative). Our train/test splits of the dataset are completely disjoint with respect to speakers. In particular, 1447 and 752 utterances are used for training and test respectively.
MOSEI dataset contains 22676 utterances from 3229 videos. The videos were crawled from Youtube. There are 1000 unique speakers in the MOSEI dataset. Videos in MOSEI mostly comprise of product and movie reviews. We used 16188, 1874, and 4614 utterances as training, validation, and test folds. respectively. The utterances are labeled with either of the positive, negative, and neutral sentiment categories.
IEMOCAP contains two way conversations among ten speakers, segmented into utterances. The utterances are tagged with one of the six emotion labels anger, happy, sad, neutral, excited, and frustrated. The first eight speakers of sessions one to four belong to training set and the rest to the test set.
4.2 Baseline Methods
Logistic Regression (LR)
The concatenation of the utterance-level unimodal representations is sequentially fed to the bc-LSTM classifier described in Section 3.4. This is the state-of-the-art method.
This network models both intra-modal and inter-modal interactions through outer product. It does not use the neighbouring utterances as context.
This network exploits multi-view learning to fuse modalities. It also does not use neighbouring utterances as context.
In this model the intra-modal and cross-modal interactions are modeled with hybrid LSTM memory component.
5 Results and Discussion
with paired t-test) over bc-LSTM.
Table 2 shows the performance our VAE-based methods, namely VAE+LR and VAE+bc-LSTM, outperform their concatenation fusion counterpart LR and bc-LSTM consistently on all three datasets. Specifically, our context-dependent model, VAE+bc-LSTM, outperforms the context-dependent state-of-the-art method bc-LSTM on all the datasets, by 3.1% on average. Moreover, our context-free model VAE+LR outperforms the other context-free models, namely MFN, MARN, TFN, and LR, on all datasets, by 1.5% on average. Also, due to the contextual information, VAE+bc-LSTM outperforms VAE+LR by 3.1% on average.
This is due to the superior multimodal representation from VAE, that retains enough information from the unimodal representations to allow reconstruction. This leads to highly informative classification. (Supplementary material compares the visualizations of the fused representations)
5.1 Case Study
Comparing the predictions of our model to the baselines reveals that our model is better equipped for catching the instances where the non-verbal cues are essential for classification. For instance, the utterance from IEMOCAP “I still can’t live on in six seven and five. It’s not possible in Los Angeles. Housing is too expensive.” is mis-classified as excited by bc-LSTM, whereas VAE+bc-LSTM correctly classifies it as angry. We posit that in this case the bc-LSTM is confused by the emotionally ambiguous textual modality, whereas the VAE+bc-LSTM taps into the visual modality to observe the frown on the speakers face to make the correct classification. Besides this, we observed several similar cases where VAE+bc-LSTM or VAE+LR correctly classifies based on non-verbal cues, where their non-VAE counterparts could not.
“No. I am just making myself fascinating for you.” is response to a question “you going out somewhere, dear?”. This is a sarcastic response. VAE+bc-LSTM falsely predicted the emotion as excited, where the ground truth is angry. We suspect that our model’s failure to identify sarcasm with the aid of multimodality led to this misclassification.
In this paper, we presented a comprehensive fusion strategy, based on VAE that outperforms previous methods by a significant margin. The encoder and decoder networks in the VAE are simple fully-connected layers. We plan to improve the performance of our method by employing more sophisticated networks, such as fusion networks like MFN and TFN as the encoders.