Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-trained Representations

02/26/2023
by   Siyuan Shen, et al.
0

Fueled by recent advances of self-supervised models, pre-trained speech representations proved effective for the downstream speech emotion recognition (SER) task. Most prior works mainly focus on exploiting pre-trained representations and just adopt a linear head on top of the pre-trained model, neglecting the design of the downstream network. In this paper, we propose a temporal shift module to mingle channel-wise information without introducing any parameter or FLOP. With the temporal shift module, three designed baseline building blocks evolve into corresponding shift variants, i.e. ShiftCNN, ShiftLSTM, and Shiftformer. Moreover, to balance the trade-off between mingling and misalignment, we propose two technical strategies, placement of shift and proportion of shift. The family of temporal shift models all outperforms the state-of-the-art methods on the benchmark IEMOCAP dataset under both finetuning and feature extraction settings. Our code is available at https://github.com/ECNU-Cross-Innovation-Lab/ShiftSER.

READ FULL TEXT
research
01/30/2021

LSSED: a large-scale dataset and benchmark for speech emotion recognition

Speech emotion recognition is a vital contributor to the next generation...
research
06/08/2023

PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models

Many recent studies have focused on fine-tuning pre-trained models for s...
research
05/18/2023

TrustSER: On the Trustworthiness of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition

Recent studies have explored the use of pre-trained embeddings for speec...
research
03/31/2022

CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition

Previous research has looked into ways to improve speech emotion recogni...
research
11/18/2020

On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition

Pre-training for feature extraction is an increasingly studied approach ...
research
11/03/2022

Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing

When recognizing emotions from speech, we encounter two common problems:...
research
03/18/2023

Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5

We used two multimodal models for continuous valence-arousal recognition...

Please sign up or login with your details

Forgot password? Click here to reset