Attentive Modality Hopping Mechanism for Speech Emotion Recognition

11/29/2019
by   Seunghyun Yoon, et al.
0

In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities. Then, the attention mechanism is defined to select and aggregate important parts of the video data by conditioning on the audio and text data. Furthermore, the attention mechanism is again applied to attend important parts of the speech and textual data, by considering other modality. Experiments are performed on the standard IEMOCAP dataset using all three modalities (audio, text, and video). The achieved results show a significant improvement of 3.65 accuracy compared to the baseline system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/26/2022

Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

In this paper, we propose a novel speech emotion recognition model calle...
research
08/01/2019

Two-Stream Video Classification with Cross-Modality Attention

Fusing multi-modality information is known to be able to effectively bri...
research
04/23/2019

Speech Emotion Recognition Using Multi-Hop Attention Mechanism

In this paper, we are interested in exploiting textual and acoustic data...
research
01/15/2019

Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition

Automatic emotion recognition (AER) is a challenging task due to the abs...
research
11/17/2021

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Multimodal emotion recognition is a challenging task in emotion computin...
research
09/14/2023

Efficient Face Detection with Audio-Based Region Proposals

Robot vision often involves a large computational load due to large imag...
research
11/11/2020

Improving Multimodal Accuracy Through Modality Pre-training and Attention

Training a multimodal network is challenging and it requires complex arc...

Please sign up or login with your details

Forgot password? Click here to reset