ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

09/14/2022
by   Liumeng Xue, et al.
0

Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. Three sub-modules, including linguistics-aware, prosody-aware and sentence-position networks, are trained together with a modified Tacotron2. Specifically, to learn the information embedded in a paragraph and the relations among the corresponding component sentences, we utilize linguistics-aware and prosody-aware networks. The information in a paragraph is captured by encoders and the inter-sentence information in a paragraph is learned with multi-head attention mechanisms. The relative sentence position in a paragraph is explicitly exploited by a sentence-position network. Trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise. The cross-sentence contextual information, such as break and prosodic variations between consecutive sentences, can be better predicted and rendered than the sentence-based model. Tested on paragraph texts, of which the lengths are similar to, longer than, or much longer than the typical paragraph length of the training data, the TTS speech produced by the new model is consistently preferred over the sentence-based model in subjective tests and confirmed in objective measures.

READ FULL TEXT

page 1

page 9

research
11/06/2020

Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Despite prosody is related to the linguistic information up to the disco...
research
03/23/2022

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Previous works on expressive speech synthesis mainly focus on current se...
research
11/15/2016

End-to-End Neural Sentence Ordering Using Pointer Network

Sentence ordering is one of important tasks in NLP. Previous works mainl...
research
02/27/2023

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Pause insertion, also known as phrase break prediction and phrasing, is ...
research
11/23/2020

Sarcasm detection from user-generated noisy short text

Sentiment analysis of social media comments is very important for review...
research
08/31/2023

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) ...
research
04/09/2019

Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS

The end-to-end TTS, which can predict speech directly from a given seque...

Please sign up or login with your details

Forgot password? Click here to reset