Spontaneous speech synthesis with linguistic-speech consistency training using pseudo-filled pauses

10/18/2022
by   Yuta Matsunaga, et al.
0

We propose a training method for spontaneous speech synthesis models that guarantees the consistency of linguistic parts of synthesized speech. Personalized spontaneous speech synthesis aims to reproduce the individuality of disfluency, such as filled pauses. Our prior model includes a filled-pause prediction model and synthesizes filled-pause-included speech from text without filled pauses. However, inserting the filled pauses degrades the quality of the linguistic parts of the synthesized speech. This might be because filled-pause insertion tendencies differ between training and inference, and the synthesis model cannot represent connections between filled pauses and surrounding phonemes in inference. We, therefore, developed a linguistic-speech consistency training that guarantees the consistency of linguistic parts of synthetic speech with and without filled pauses. The proposed consistency training utilizes not only ground-truth-filled pauses but also pseudo ones. Our experiments demonstrate that this method improves the naturalness of the synthetic linguistic speech and the entire predicted-filled-pause-included synthetic speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2022

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

We present a comprehensive empirical study for personalized spontaneous ...
research
02/24/2023

PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS

Previous pitch-controllable text-to-speech (TTS) models rely on directly...
research
10/22/2020

How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers?

We have been working on speech synthesis for rakugo (a traditional Japan...
research
03/29/2022

Applying Syntaxx2013Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis

End-to-end text-to-speech synthesis (TTS), which generates speech sounds...
research
08/02/2018

Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

We investigated the impact of noisy linguistic features on the performan...
research
07/05/2021

Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Articulatory information has been shown to be effective in improving the...
research
09/13/2022

Deep Speech Synthesis from Articulatory Representations

In the articulatory synthesis task, speech is synthesized from input fea...

Please sign up or login with your details

Forgot password? Click here to reset