Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

09/08/2021
by   Songxiang Liu, et al.
0

Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speaker's voice. Most previous CSST approaches rely on expensive high-quality data carrying desired speaking style during training and require a reference utterance to obtain speaking style descriptors as conditioning on the generation of a new sentence. This work presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text. Referee is built by cascading a text-to-style (T2S) model with a style-to-wave (S2W) model. Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are adopted as fine-grained speaking style descriptors, which are predicted from text using the T2S model. A novel pretrain-refinement method is adopted to learn a robust T2S model by only using readily accessible low-quality data. The S2W model is trained with high-quality target data, which is adopted to effectively aggregate style descriptors and generate high-fidelity speech in the target speaker's voice. Experimental results are presented, showing that Referee outperforms a global-style-token (GST)-based baseline approach in CSST.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2021

Expressive Neural Voice Cloning

Voice cloning is the task of learning to synthesize the voice of an unse...
research
07/27/2021

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

Cross-speaker style transfer is crucial to the applications of multi-sty...
research
08/10/2023

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Recent work has shown that it is possible to resynthesize high-quality s...
research
08/04/2020

Expressive TTS Training with Frame and Style Reconstruction Loss

We propose a novel training strategy for Tacotron-based text-to-speech (...
research
01/19/2022

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

Neural network based end-to-end Text-to-Speech (TTS) has greatly improve...
research
08/28/2022

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Transfer tasks in text-to-speech (TTS) synthesis - where one or more asp...
research
12/13/2022

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Cross-speaker style transfer in speech synthesis aims at transferring a ...

Please sign up or login with your details

Forgot password? Click here to reset