Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

09/09/2019
by   Rob Clark, et al.
0

Text-to-speech systems are typically evaluated on single sentences. When long-form content, such as data consisting of full paragraphs or dialogues is considered, evaluating sentences in isolation is not always appropriate as the context in which the sentences are synthesized is missing. In this paper, we investigate three different ways of evaluating the naturalness of long-form text-to-speech synthesis. We compare the results obtained from evaluating sentences in isolation, evaluating whole paragraphs of speech, and presenting a selection of speech or text as context and evaluating the subsequent speech. We find that, even though these three evaluations are based upon the same material, the outcomes differ per setting, and moreover that these outcomes do not necessarily correlate with each other. We show that our findings are consistent between a single speaker setting of read paragraphs and a two-speaker dialogue scenario. We conclude that to evaluate the quality of long-form speech, the traditional way of evaluating sentences in isolation does not suffice, and that multiple evaluations are required.

READ FULL TEXT
research
10/05/2020

JSSS: free Japanese speech corpus for summarization and simplification

In this paper, we construct a new Japanese speech corpus for speech-base...
research
05/31/2021

Byakto Speech: Real-time long speech synthesis with convolutional neural network: Transfer learning from English to Bangla

Speech synthesis is one of the challenging tasks to automate by deep lea...
research
01/25/2022

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

With recent advancements in voice cloning, the performance of speech syn...
research
07/11/2022

Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data

In this paper, we investigate the semi-supervised joint training of text...
research
03/15/2022

Text-free non-parallel many-to-many voice conversion using normalising flows

Non-parallel voice conversion (VC) is typically achieved using lossy rep...
research
10/12/2022

Perplexity from PLM Is Unreliable for Evaluating Text Quality

Recently, amounts of works utilize perplexity (PPL) to evaluate the qual...
research
08/03/2022

A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

In human speech, the attitude of a speaker cannot be fully expressed onl...

Please sign up or login with your details

Forgot password? Click here to reset