A vector quantized masked autoencoder for speech emotion recognition

04/21/2023
by   Samir Sadok, et al.
0

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector-quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER.

READ FULL TEXT
research
05/05/2023

A vector quantized masked autoencoder for audiovisual speech emotion recognition

While fully-supervised models have been shown to be effective for audiov...
research
04/08/2022

Transformer-Based Self-Supervised Learning for Emotion Recognition

In order to exploit representations of time-series signals, such as phys...
research
05/05/2023

A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning

In this paper, we present a multimodal and dynamical VAE (MDVAE) applied...
research
07/23/2023

Self-Supervised Learning for Audio-Based Emotion Recognition

Emotion recognition models using audio input data can enable the develop...
research
10/09/2021

Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Recently, there have been tremendous research outcomes in the fields of ...
research
03/01/2022

Towards a Common Speech Analysis Engine

Recent innovations in self-supervised representation learning have led t...
research
03/28/2022

Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Speech emotion recognition (SER) refers to the technique of inferring th...

Please sign up or login with your details

Forgot password? Click here to reset