Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

11/06/2020
by   Guanghui Xu, et al.
0

Despite prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account that within each sentence, which makes it challenging when converting a paragraph of texts into natural and expressive speech. In this paper, we propose to use the text embeddings of the neighboring sentences to improve the prosody generation for each utterance of a paragraph in an end-to-end fashion without using any explicit prosody features. More specifically, cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model, are used to augment the input of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which leads to the use of different CU encoder structures. Experimental results on a Mandarin audiobook dataset and the LJ-Speech English audiobook dataset demonstrate the use of CU information can improve the naturalness and expressiveness of the synthesized speech. Subjective listening testing shows most of the participants prefer the voice generated using the CU encoder over that generated using standard Tacotron2. It is also found that the prosody can be controlled indirectly by changing the neighbouring sentences.

READ FULL TEXT
research
08/31/2023

Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information

This paper presents an end-to-end high-quality singing voice synthesis (...
research
09/14/2022

ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

Recent advancements in neural end-to-end TTS models have shown high-qual...
research
11/01/2022

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

Current state-of-the-art methods for automatic synthetic speech evaluati...
research
11/10/2022

Assistive Completion of Agrammatic Aphasic Sentences: A Transfer Learning Approach using Neurolinguistics-based Synthetic Dataset

Damage to the inferior frontal gyrus (Broca's area) can cause agrammatic...
research
11/11/2022

MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy

Humans often speak in a continuous manner which leads to coherent and co...
research
04/14/2021

Dependency Parsing based Semantic Representation Learning with Graph Neural Network for Enhancing Expressiveness of Text-to-Speech

Semantic information of a sentence is crucial for improving the expressi...
research
08/03/2020

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis

Attention-based seq2seq text-to-speech systems, especially those use sel...

Please sign up or login with your details

Forgot password? Click here to reset