LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences

12/03/2021
by   Ziwang Fu, et al.
0

Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, those approaches introduce information redundancy when fusing features and are inefficient without considering the complementarity of modalities. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design a novel transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning,cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2021

A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition

The audio-video based multimodal emotion recognition has attracted a lot...
research
03/26/2023

RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

Existing Transformer-based RGBT tracking methods either use cross-attent...
research
05/23/2023

Cross-Attention is Not Enough: Incongruity-Aware Hierarchical Multimodal Sentiment Analysis and Emotion Recognition

Fusing multiple modalities for affective computing tasks has proven effe...
research
06/01/2019

Multimodal Transformer for Unaligned Multimodal Language Sequences

Human language is often multimodal, which comprehends a mixture of natur...
research
03/17/2021

Multimodal End-to-End Sparse Model for Emotion Recognition

Existing works on multimodal affective computing tasks, such as emotion ...
research
03/18/2023

Just Noticeable Visual Redundancy Forecasting: A Deep Multimodal-driven Approach

Just noticeable difference (JND) refers to the maximum visual change tha...
research
10/15/2021

StreaMulT: Streaming Multimodal Transformer for Heterogeneous and Arbitrary Long Sequential Data

This paper tackles the problem of processing and combining efficiently a...

Please sign up or login with your details

Forgot password? Click here to reset