SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

04/06/2022
by   Georgia Maniati, et al.
0

In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations. An LPCNet vocoder is used for all systems, so that the samples' variation depends only on the acoustic models. The synthesized utterances provide balanced and adequate domain and length coverage. We collect MOS naturalness evaluations on 3 English Amazon Mechanical Turk locales and share practices leading to reliable crowdsourced annotations for this task. Baseline results of state-of-the-art MOS prediction models on the SOMOS dataset are presented, while we show the challenges that such models face when assigned to evaluate synthetic utterances.

READ FULL TEXT
research
09/14/2023

SingFake: Singing Voice Deepfake Detection

The rise of singing voice synthesis presents critical challenges to arti...
research
10/18/2021

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

As increasing development of text-to-speech (TTS) and voice conversion (...
research
06/21/2021

Computational Pronunciation Analysis in Sung Utterances

Recent automatic lyrics transcription (ALT) approaches focus on building...
research
07/06/2021

Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Text-to-Speech synthesis systems are generally evaluated using Mean Opin...
research
07/04/2022

BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model

Several recent studies have tested the use of transformer language model...
research
09/14/2022

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Non-reference speech quality models are important for a growing number o...
research
08/11/2020

Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems

LPCNet is an efficient vocoder that combines linear prediction and deep ...

Please sign up or login with your details

Forgot password? Click here to reset