Learning Local to Global Feature Aggregation for Speech Emotion Recognition

06/02/2023
by   Cheng Lu, et al.
0

Transformer has emerged in speech emotion recognition (SER) at present. However, its equal patch division not only damages frequency information but also ignores local emotion correlations across frames, which are key cues to represent emotion. To handle the issue, we propose a Local to Global Feature Aggregation learning (LGFA) for SER, which can aggregate longterm emotion correlations at different scales both inside frames and segments with entire frequency information to enhance the emotion discrimination of utterance-level speech features. For this purpose, we nest a Frame Transformer inside a Segment Transformer. Firstly, Frame Transformer is designed to excavate local emotion correlations between frames for frame embeddings. Then, the frame embeddings and their corresponding segment features are aggregated as different-level complements to be fed into Segment Transformer for learning utterance-level global emotion features. Experimental results show that the performance of LGFA is superior to the state-of-the-art methods.

READ FULL TEXT
research
08/28/2023

Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition

In this paper, we propose a novel time-frequency joint learning method f...
research
11/09/2018

Integrating Recurrence Dynamics for Speech Emotion Recognition

We investigate the performance of features that can capture nonlinear re...
research
08/15/2020

Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

Categorical speech emotion recognition is typically performed as a seque...
research
03/08/2022

SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech

Transformer has obtained promising results on cognitive speech signal pr...
research
10/25/2018

Multi-Channel Auto-Encoder for Speech Emotion Recognition

Inferring emotion status from users' queries plays an important role to ...
research
03/03/2023

DWFormer: Dynamic Window transFormer for Speech Emotion Recognition

Speech emotion recognition is crucial to human-computer interaction. The...
research
10/26/2021

TNTC: two-stream network with transformer-based complementarity for gait-based emotion recognition

Recognizing the human emotion automatically from visual characteristics ...

Please sign up or login with your details

Forgot password? Click here to reset