PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling

06/13/2023
by   Ji-Sang Hwang, et al.
0

Although text-to-speech (TTS) systems have significantly improved, most TTS systems still have limitations in synthesizing speech with appropriate phrasing. For natural speech synthesis, it is important to synthesize the speech with a phrasing structure that groups words into phrases based on semantic information. In this paper, we propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. First, we introduce a phrasing structure encoder that utilizes a context representation from the pre-trained language model. In the phrasing structure encoder, we extract a speaker-dependent syntactic representation from the context representation and then predict a pause sequence that separates the input text into phrases. Furthermore, we introduce a pause-based word encoder to model word-level prosody based on pause sequence. Experimental results show PauseSpeech outperforms previous models in terms of naturalness. Furthermore, in terms of objective evaluations, we can observe that our proposed methods help the model decrease the distance between ground-truth and synthesized speech. Audio samples are available at https://jisang93.github.io/pausespeech-demo/.

READ FULL TEXT

page 2

page 9

research
08/13/2020

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Recent neural speech synthesis systems have gradually focused on the con...
research
07/28/2023

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Recently, there has been a growing interest in text-to-speech (TTS) meth...
research
02/19/2021

Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input

The prosody of a spoken word is determined by its surrounding context. I...
research
08/25/2023

EntropyRank: Unsupervised Keyphrase Extraction via Side-Information Optimization for Language Model-based Text Compression

We propose an unsupervised method to extract keywords and keyphrases fro...
research
01/03/2019

Feature reinforcement with word embedding and parsing information in neural TTS

In this paper, we propose a feature reinforcement method under the seque...
research
04/25/2022

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

The recent progress in non-autoregressive text-to-speech (NAR-TTS) has m...
research
08/03/2022

A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

In human speech, the attitude of a speaker cannot be fully expressed onl...

Please sign up or login with your details

Forgot password? Click here to reset