Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

02/27/2023
by   Dong Yang, et al.
0

Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-speaker speech corpus. To this end, we propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus, injecting speaker embedding to capture various speaker characteristics. We also leverage duration-aware pause insertion for more natural multi-speaker TTS. We develop and evaluate two types of models. The first improves conventional phrasing models on the position prediction of respiratory pauses (RPs), i.e., silent pauses at word transitions without punctuation. It performs speaker-conditioned RP prediction considering contextual information and is used to demonstrate the effect of speaker information on the prediction. The second model is further designed for phoneme-based TTS models and performs duration-aware pause insertion, predicting both RPs and punctuation-indicated pauses (PIPs) that are categorized by duration. The evaluation results show that our models improve the precision and recall of pause insertion and the rhythm of synthetic speech.

READ FULL TEXT
research
08/03/2022

The SJTU System for Short-duration Speaker Verification Challenge 2021

This paper presents the SJTU system for both text-dependent and text-ind...
research
09/26/2022

Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

In forensic voice comparison the speaker embedding has become widely pop...
research
06/28/2022

Expressive, Variable, and Controllable Duration Modelling in TTS

Duration modelling has become an important research problem once more wi...
research
11/05/2020

Improving Event Duration Prediction via Time-aware Pre-training

End-to-end models in NLP rarely encode external world knowledge about le...
research
04/26/2021

Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis

We propose a novel phrase break prediction method that combines implicit...
research
09/14/2022

ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

Recent advancements in neural end-to-end TTS models have shown high-qual...
research
04/07/2020

Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots

In this paper, we study the problem of employing pre-trained language mo...

Please sign up or login with your details

Forgot password? Click here to reset