Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

08/13/2020
by   Zhen Zeng, et al.
0

Recent neural speech synthesis systems have gradually focused on the control of prosody to improve the quality of synthesized speech, but they rarely consider the variability of prosody and the correlation between prosody and semantics together. In this paper, a prosody learning mechanism is proposed to model the prosody of speech based on TTS system, where the prosody information of speech is extracted from the melspectrum by a prosody learner and combined with the phoneme sequence to reconstruct the mel-spectrum. Meanwhile, the sematic features of text from the pre-trained language model is introduced to improve the prosody prediction results. In addition, a novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length, where the relative position information of the sequence is modeled by the relative position matrices so that the position encodings is no longer needed. Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model. Especially in Mandarin synthesis, our proposed model outperforms baseline model with a MOS gap of 0.08, and the overall naturalness of the synthesized speech has been significantly improved.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2023

PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling

Although text-to-speech (TTS) systems have significantly improved, most ...
research
06/05/2023

Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis

Regressive Text-to-Speech (TTS) system utilizes attention mechanism to g...
research
10/29/2018

Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language

End-to-end speech synthesis is a promising approach that directly conver...
research
01/03/2019

Feature reinforcement with word embedding and parsing information in neural TTS

In this paper, we propose a feature reinforcement method under the seque...
research
05/18/2020

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding

On account of growing demands for personalization, the need for a so-cal...
research
06/29/2020

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Recent advances in deep learning methods have elevated synthetic speech ...
research
08/07/2020

Controllable Neural Prosody Synthesis

Speech synthesis has recently seen significant improvements in fidelity,...

Please sign up or login with your details

Forgot password? Click here to reset