Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

09/14/2022
by   Michael Chinen, et al.
2

Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.

READ FULL TEXT
research
01/23/2020

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework

The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to pro...
research
05/16/2020

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results

The INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge is intended ...
research
03/07/2023

Do Prosody Transfer Models Transfer Prosody?

Some recent models for Text-to-Speech synthesis aim to transfer the pros...
research
04/06/2022

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

In this work, we present the SOMOS dataset, the first large-scale mean o...
research
11/24/2019

Enhancing Out-Of-Domain Utterance Detection with Data Augmentation Based on Word Embeddings

For most intelligent assistant systems, it is essential to have a mechan...
research
04/06/2021

An Initial Investigation for Detecting Partially Spoofed Audio

All existing databases of spoofed speech contain attack data that is spo...
research
08/09/2023

Automatically measuring speech fluency in people with aphasia: first achievements using read-speech data

Background: Speech and language pathologists (SLPs) often relyon judgeme...

Please sign up or login with your details

Forgot password? Click here to reset