Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

07/26/2022
by   Yoonhyung Lee, et al.
0

In this paper, we propose a novel speech emotion recognition model called Cross Attention Network (CAN) that uses aligned audio and text signals as inputs. It is inspired by the fact that humans recognize speech as a combination of simultaneously produced acoustic and textual signals. First, our method segments the audio and the underlying text signals into equal number of steps in an aligned way so that the same time steps of the sequential signals cover the same time span in the signals. Together with this technique, we apply the cross attention to aggregate the sequential information from the aligned signals. In the cross attention, each modality is aggregated independently by applying the global attention mechanism onto each modality. Then, the attention weights of each modality are applied directly to the other modality in a crossed way, so that the CAN gathers the audio and text information from the same time steps based on each modality. In the experiments conducted on the standard IEMOCAP dataset, our model outperforms the state-of-the-art systems by 2.66

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/06/2019

Learning Alignment for Multimodal Emotion Recognition from Speech

Speech emotion recognition is a challenging problem because human convey...
research
11/29/2019

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

In this work, we explore the impact of visual modality in addition to sp...
research
10/10/2018

Multimodal Speech Emotion Recognition Using Audio and Text

Speech emotion recognition is a challenging task, and extensive reliance...
research
06/12/2023

Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus

The emotion detection technology to enhance human decision-making is an ...
research
06/01/2019

Multimodal Transformer for Unaligned Multimodal Language Sequences

Human language is often multimodal, which comprehends a mixture of natur...
research
04/23/2019

Speech Emotion Recognition Using Multi-Hop Attention Mechanism

In this paper, we are interested in exploiting textual and acoustic data...
research
07/21/2023

A Change of Heart: Improving Speech Emotion Recognition through Speech-to-Text Modality Conversion

Speech Emotion Recognition (SER) is a challenging task. In this paper, w...

Please sign up or login with your details

Forgot password? Click here to reset