Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation

07/02/2022
by   Vikramjit Mitra, et al.
0

Estimating dimensional emotions, such as activation, valence and dominance, from acoustic speech signals has been widely explored over the past few years. While accurate estimation of activation and dominance from speech seem to be possible, the same for valence remains challenging. Previous research has shown that the use of lexical information can improve valence estimation performance. Lexical information can be obtained from pre-trained acoustic models, where the learned representations can improve valence estimation from speech. We investigate the use of pre-trained model representations to improve valence estimation from acoustic speech signal. We also explore fusion of representations to improve emotion estimation across all three emotion dimensions: activation, valence and dominance. Additionally, we investigate if representations from pre-trained models can be distilled into models trained with low-level features, resulting in models with a less number of parameters. We show that fusion of pre-trained model embeddings result in a 79 improvement in concordance correlation coefficient CCC on valence estimation compared to standard acoustic feature baseline (mel-filterbank energies), while distillation from pre-trained model embeddings to lower-dimensional representations yielded a relative 12 observed over two evaluation sets, indicating that our proposed architecture generalizes across those evaluation sets. We report new state-of-the-art "text-free" acoustic-only dimensional emotion estimation CCC values on two MSP-Podcast evaluation sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2023

Pre-trained Model Representations and their Robustness against Noise for Speech Emotion Analysis

Pre-trained model representations have demonstrated state-of-the-art per...
research
06/24/2022

Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion, Age, and Origin from Vocal Bursts

We present Burst2Vec, our multi-task learning approach to predict emotio...
research
05/26/2023

Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation

In end-to-end speech translation, speech and text pre-trained models imp...
research
08/11/2023

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

The generation of co-speech gestures for digital humans is an emerging a...
research
09/20/2023

Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech

Speech emotion recognition has evolved from research to practical applic...
research
12/07/2022

Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue

Entrainment is the phenomenon by which an interlocutor adapts their spea...
research
04/04/2022

Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection

The adoption of advanced deep learning (DL) architecture in stuttering d...

Please sign up or login with your details

Forgot password? Click here to reset