Pre-Finetuning for Few-Shot Emotional Speech Recognition

02/24/2023
by   Maximillian Chen, et al.
0

Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2019

Exploring Transfer Learning for Low Resource Emotional TTS

During the last few years, spoken language technologies have known a big...
research
05/23/2023

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

Emotional Text-To-Speech (TTS) is an important task in the development o...
research
05/25/2022

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

We introduce FLEURS, the Few-shot Learning Evaluation of Universal Repre...
research
04/06/2022

Emotional Speech Recognition with Pre-trained Deep Visual Models

In this paper, we propose a new methodology for emotional speech recogni...
research
07/01/2020

LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity

In speech recognition problems, data scarcity often poses an issue due t...
research
05/25/2022

ToKen: Task Decomposition and Knowledge Infusion for Few-Shot Hate Speech Detection

Hate speech detection is complex; it relies on commonsense reasoning, kn...
research
01/19/2022

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

The prediction of valence from speech is an important, but challenging p...

Please sign up or login with your details

Forgot password? Click here to reset