Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

03/14/2023
by   Chunyu Qiang, et al.
0

Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre. In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style, similar to the one-to-many problem(i.e., multiple prosody variations correspond to the same text). In response to this problem, a strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre, improving the representation and interpretability of the global style embedding, which can alleviate the one-to-many mapping and data imbalance problems in prosody prediction. A hierarchical prosody predictor is proposed to improve prosody modeling. We find that better style transfer can be achieved by using the source speaker's prosody features that are easily predicted. Additionally, a speaker-transfer-wise cycle consistency loss is proposed to assist the model in learning unseen style-timbre combinations during the training phase. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.

READ FULL TEXT
research
12/13/2022

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Cross-speaker style transfer in speech synthesis aims at transferring a ...
research
07/27/2021

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

Cross-speaker style transfer is crucial to the applications of multi-sty...
research
11/08/2020

Fine-grained style modelling and transfer in text-to-speech synthesis via content-style disentanglement

This paper presents a novel neural model for fine-grained style modeling...
research
11/12/2020

Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

Prosody modeling is an essential component in modern text-to-speech (TTS...
research
01/24/2022

Disentangling Style and Speaker Attributes for TTS Style Transfer

End-to-end neural TTS has shown improved performance in speech style tra...
research
06/18/2021

Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

End-to-end neural TTS training has shown improved performance in speech ...
research
08/31/2023

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Existing automated dubbing methods are usually designed for Professional...

Please sign up or login with your details

Forgot password? Click here to reset