Evaluation of Speech Representations for MOS prediction

06/16/2023
by   Frederico S. Oliveira, et al.
0

In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was created for this work. The results show that the Whisper model is appropriate in all scenarios: with both the VCC2018 and BRSpeech- MOS datasets. Among the supervised and self-supervised learning models using BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and the speaker verification model, SpeakerNet, had linear correlation of 0.6963. Using VCC2018, the best supervised and self-supervised learning model, Whisper-Large, achieved linear correlation of 0.7274, and the best model speaker verification, TitaNet, achieved a linear correlation of 0.6933. Although the results of the speaker verification models are slightly lower, the SpeakerNet model has only 5M parameters, making it suitable for real-time applications, and the TitaNet model produces an embedding of size 192, the smallest among all the evaluated models. The experiment results are reproducible with publicly available source-code1 .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2022

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Recently, self-supervised learning (SSL) has demonstrated strong perform...
research
04/08/2022

Self-supervised Speaker Diarization

Over the last few years, deep learning has grown in popularity for speak...
research
11/08/2022

Pushing the limits of self-supervised speaker verification using regularized distillation framework

Training robust speaker verification systems without speaker labels has ...
research
07/05/2023

Self-supervised learning with diffusion-based multichannel speech enhancement for speaker verification under noisy conditions

The paper introduces Diff-Filter, a multichannel speech enhancement appr...
research
03/16/2022

Raw waveform speaker verification for supervised and self-supervised learning

Speaker verification models that directly operate upon raw waveforms are...
research
08/03/2022

Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction

For self-supervised speaker verification, the quality of pseudo labels d...
research
09/17/2023

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

The performance of deep learning models depends significantly on their c...

Please sign up or login with your details

Forgot password? Click here to reset