A vector quantized masked autoencoder for audiovisual speech emotion recognition

05/05/2023
by   Samir Sadok, et al.
0

While fully-supervised models have been shown to be effective for audiovisual speech emotion recognition (SER), the limited availability of labeled data remains a major challenge in the field. To address this issue, self-supervised learning approaches, such as masked autoencoders (MAEs), have gained popularity as potential solutions. In this paper, we propose the VQ-MAE-AV model, a vector quantized MAE specifically designed for audiovisual speech self-supervised representation learning. Unlike existing multimodal MAEs that rely on the processing of the raw audiovisual speech data, the proposed method employs a self-supervised paradigm based on discrete audio and visual speech representations learned by two pre-trained vector quantized variational autoencoders. Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods.

READ FULL TEXT

page 3

page 6

research
04/21/2023

A vector quantized masked autoencoder for speech emotion recognition

Recent years have seen remarkable progress in speech emotion recognition...
research
08/31/2023

RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting

Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the...
research
03/29/2023

Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition

Recently, wearable emotion recognition based on peripheral physiological...
research
11/18/2020

On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition

Pre-training for feature extraction is an increasingly studied approach ...
research
03/01/2022

Towards a Common Speech Analysis Engine

Recent innovations in self-supervised representation learning have led t...
research
06/03/2020

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Probabilistic Latent Variable Models (LVMs) provide an alternative to se...
research
01/02/2023

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Driven by improved architectures and better representation learning fram...

Please sign up or login with your details

Forgot password? Click here to reset