Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

04/13/2023
by   Shun Lei, et al.
0

Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding of the current utterance, a hierarchical transformer-based context-aware style predictor with a mixture attention mask is designed, considering both text-side context information and speech-side style information of previous speeches. Based on this, we can generate long-form speech with coherent style and prosody sentence by sentence. Objective and subjective evaluations on a Mandarin audiobook dataset demonstrate that our proposed model can generate speech with more expressive and coherent speaking style than baselines, for both single-sentence and multi-sentence test.

READ FULL TEXT
research
04/06/2022

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Previous works on expressive speech synthesis focus on modelling the mon...
research
06/25/2022

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Expressive speech synthesis, like audiobook synthesis, is still challeng...
research
06/29/2022

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

Generating expressive and contextually appropriate prosody remains a cha...
research
11/11/2022

MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy

Humans often speak in a continuous manner which leads to coherent and co...
research
08/07/2020

Controllable Neural Prosody Synthesis

Speech synthesis has recently seen significant improvements in fidelity,...
research
10/20/2017

Detecting Online Hate Speech Using Context Aware Models

In the wake of a polarizing election, the cyber world is laden with hate...
research
07/29/2023

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Expressive speech synthesis is crucial for many human-computer interacti...

Please sign up or login with your details

Forgot password? Click here to reset