Explainable Multimodal Emotion Reasoning

by   Zheng Lian, et al.

Multimodal emotion recognition is an active research topic in artificial intelligence. Its primary objective is to integrate multi-modalities (such as acoustic, visual, and lexical clues) to identify human emotional states. Current works generally assume accurate emotion labels for benchmark datasets and focus on developing more effective architectures. But due to the inherent subjectivity of emotions, existing datasets often lack high annotation consistency, resulting in potentially inaccurate labels. Consequently, models built on these datasets may struggle to meet the demands of practical applications. To address this issue, it is crucial to enhance the reliability of emotion annotations. In this paper, we propose a novel task called “Explainable Multimodal Emotion Reasoning (EMER)”. In contrast to previous works that primarily focus on predicting emotions, EMER takes a step further by providing explanations for these predictions. The prediction is considered correct as long as the reasoning process behind the predicted emotion is plausible. This paper presents our initial efforts on EMER, where we introduce a benchmark dataset, establish baseline models, and define evaluation metrics. Meanwhile, we observe the necessity of integrating multi-faceted capabilities to deal with EMER. Therefore, we propose the first multimodal large language model (LLM) in affective computing, called AffectGPT. We aim to tackle the long-standing challenge of label ambiguity and chart a path toward more reliable techniques. Furthermore, EMER offers an opportunity to evaluate the audio-video-text understanding capabilities of recent multimodal LLM. To facilitate further research, we make the code and data available at: https://github.com/zeroQiaoba/AffectGPT.


page 6

page 7


A Multimodal Corpus for Emotion Recognition in Sarcasm

While sentiment and emotion analysis have been studied extensively, the ...

Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks

Most current audio-visual emotion recognition models lack the flexibilit...

Interpretability for Multimodal Emotion Recognition using Concept Activation Vectors

Multimodal Emotion Recognition refers to the classification of input vid...

Multimodal Local-Global Ranking Fusion for Emotion Recognition

Emotion recognition is a core research area at the intersection of artif...

SGED: A Benchmark dataset for Performance Evaluation of Spiking Gesture Emotion Recognition

In the field of affective computing, researchers in the community have p...

Automatic Emotion Modelling in Written Stories

Telling stories is an integral part of human communication which can evo...

Applying Probabilistic Programming to Affective Computing

Affective Computing is a rapidly growing field spurred by advancements i...

Please sign up or login with your details

Forgot password? Click here to reset