A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

11/04/2021
by   Yingzhi Wang, et al.
0

Self-supervised speech representations such as wav2vec 2.0 and HuBERT are making revolutionary progress in Automatic Speech Recognition (ASR). However, self-supervised models have not been totally proved to produce better performance on tasks other than ASR. In this work, we explore partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks : Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. We also compare pre-trained models with/without ASR fine-tuning. With simple down-stream frameworks, the best scores reach 79.58 IEMOCAP, 2.36 accuracy for Intent Classification and 75.32 thus setting a new state-of-the-art for these three benchmarks, proving that fine-tuned wav2vec 2.0 and HuBERT models can better learn prosodic, voice-print and semantic representations.

READ FULL TEXT
research
05/20/2023

Self-supervised representations in speech-based depression detection

This paper proposes handling training data sparsity in speech-based auto...
research
03/18/2023

A Deep Learning System for Domain-specific speech Recognition

As human-machine voice interfaces provide easy access to increasingly in...
research
06/10/2023

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

This study is focused on understanding and quantifying the change in pho...
research
12/16/2022

Context-aware Fine-tuning of Self-supervised Speech Models

Self-supervised pre-trained transformers have improved the state of the ...
research
08/24/2022

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

A cornerstone in AI research has been the creation and adoption of stand...
research
06/08/2023

PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models

Many recent studies have focused on fine-tuning pre-trained models for s...
research
07/08/2022

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Self-supervised-learning-based pre-trained models for speech data, such ...

Please sign up or login with your details

Forgot password? Click here to reset