DeepAI AI Chat
Log In Sign Up

A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition

by   Ziwang Fu, et al.

The audio-video based multimodal emotion recognition has attracted a lot of attention due to its robust performance. Most of the existing methods focus on proposing different cross-modal fusion strategies. However, these strategies introduce redundancy in the features of different modalities without fully considering the complementary properties between modal information, and these approaches do not guarantee the non-loss of original semantic information during intra- and inter-modal interactions. In this paper, we propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. Firstly, we perform representation learning for audio and video modalities to obtain the semantic features of the two modalities by efficient ResNeXt and 1D CNN, respectively. Secondly, we feed the features of the two modalities into the cross-modal blocks separately to ensure efficient complementarity and completeness of information through the self-attention mechanism and residual structure. Finally, we obtain the output of emotions by splicing the obtained fused representation with the original representation. To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset. The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76 26.30M parameters. Our code is available at


page 1

page 2

page 3

page 4


Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?

Humans express their emotions via facial expressions, voice intonation a...

A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition

Motion recognition is a promising direction in computer vision, but the ...

A Multimodal Approach for Dementia Detection from Spontaneous Speech with Tensor Fusion Layer

Alzheimer's disease (AD) is a progressive neurological disorder, meaning...

Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection

Multimodal learning is an emerging yet challenging research area. In thi...

AttendAffectNet: Self-Attention based Networks for Predicting Affective Responses from Movies

In this work, we propose different variants of the self-attention based ...