DeepAI AI Chat
Log In Sign Up

Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

by   Puneet Kumar, et al.
IIT Roorkee

This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTA Net), to classify the emotions reflected by a multimodal input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has also been developed to identify the important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTA Net fuses the information from image, speech text modalities using a hybrid of early and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average without human intervention. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labeled with discrete emotion classes, we have constructed a large-scale IIT-R MMEmoRec dataset consisting of real-life images, corresponding speech text, and emotion labels ('angry,' 'happy,' 'hate,' and 'sad.'). The VISTA Net has resulted in 95.99 accuracy on considering image, speech, and text modalities, which is better than the performance on considering the inputs of any one or two modalities.


page 1

page 6

page 18


Interpretable Multimodal Emotion Recognition using Hybrid Fusion of Speech and Image Data

This paper proposes a multimodal emotion recognition system based on hyb...

Temporal aggregation of audio-visual modalities for emotion recognition

Emotion recognition has a pivotal role in affective computing and in hum...

Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For Multimodal Emotion Recognition

The lack of data and the difficulty of multimodal fusion have always bee...

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Multimodal emotion recognition from speech is an important area in affec...

Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition

Automatic emotion recognition (AER) based on enriched multimodal inputs,...

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

We present M3ER, a learning-based method for emotion recognition from mu...

Decoupled Multimodal Distilling for Emotion Recognition

Human multimodal emotion recognition (MER) aims to perceive human emotio...