Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

07/06/2021
by   Elijah Gutierrez, et al.
0

Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook test set samples, we see that error marks consistently cluster around words at major prosodic boundaries indicated by punctuation. However, for question-answer based stimuli, where we control information structure, we see differences emerge in the ability of neural TTS systems to generate context-appropriate prosodic prominence.

READ FULL TEXT
research
08/01/2021

End to End Bangla Speech Synthesis

Text-to-Speech (TTS) system is a system where speech is synthesized from...
research
04/06/2022

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

In this work, we present the SOMOS dataset, the first large-scale mean o...
research
10/27/2022

FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

Conversational Text-to-Speech (TTS) aims to synthesis an utterance with ...
research
03/06/2022

Variational Auto-Encoder based Mandarin Speech Cloning

Speech cloning technology is becoming more sophisticated thanks to the a...
research
10/07/2015

Hierarchical Representation of Prosody for Statistical Speech Synthesis

Prominences and boundaries are the essential constituents of prosodic st...
research
08/09/2020

Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling

While deep learning has made impressive progress in speech synthesis and...
research
02/22/2022

Wavebender GAN: An architecture for phonetically meaningful speech manipulation

Deep learning has revolutionised synthetic speech quality. However, it h...

Please sign up or login with your details

Forgot password? Click here to reset