DeepAI AI Chat
Log In Sign Up

Speech Emotion Recognition using Self-Supervised Features

by   Edmilson Morais, et al.

Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities.


page 1

page 2

page 3

page 4


End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

Transformer networks and self-supervised pre-training have consistently ...

Towards a Common Speech Analysis Engine

Recent innovations in self-supervised representation learning have led t...

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Multimodal emotion recognition from speech is an important area in affec...

TrustSER: On the Trustworthiness of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition

Recent studies have explored the use of pre-trained embeddings for speec...