An Efficient End-to-End Transformer with Progressive Tri-modal Attention for Multi-modal Emotion Recognition

09/20/2022
by   Yang Wu, et al.
0

Recent works on multi-modal emotion recognition move towards end-to-end models, which can extract the task-specific features supervised by the target task compared with the two-phase pipeline. However, previous methods only model the feature interactions between the textual and either acoustic and visual modalities, ignoring capturing the feature interactions between the acoustic and visual modalities. In this paper, we propose the multi-modal end-to-end transformer (ME2ET), which can effectively model the tri-modal features interaction among the textual, acoustic, and visual modalities at the low-level and high-level. At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy and can further leverage such interactions to significantly reduce the computation and memory complexity through reducing the input token length. At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities. The experimental results on the CMU-MOSEI and IEMOCAP datasets show that ME2ET achieves the state-of-the-art performance. The further in-depth analysis demonstrates the effectiveness, efficiency, and interpretability of the proposed progressive tri-modal attention, which can help our model to achieve better performance while significantly reducing the computation and memory cost. Our code will be publicly available.

READ FULL TEXT

page 7

page 8

research
09/09/2020

Multi-modal Attention for Speech Emotion Recognition

Emotion represents an essential aspect of human speech that is manifeste...
research
06/05/2022

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Emotion Recognition in Conversations (ERC) is crucial in developing symp...
research
06/23/2023

TACOformer:Token-channel compounded Cross Attention for Multimodal Emotion Recognition

Recently, emotion recognition based on physiological signals has emerged...
research
05/17/2020

Multi-modal Automated Speech Scoring using Attention Fusion

In this study, we propose a novel multi-modal end-to-end neural approach...
research
04/06/2021

Efficient emotion recognition using hyperdimensional computing with combinatorial channel encoding and cellular automata

In this paper, a hardware-optimized approach to emotion recognition base...
research
11/30/2021

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Linguistic knowledge has brought great benefits to scene text recognitio...
research
02/26/2023

Multi-Modality in Music: Predicting Emotion in Music from High-Level Audio Features and Lyrics

This paper aims to test whether a multi-modal approach for music emotion...

Please sign up or login with your details

Forgot password? Click here to reset