Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

12/13/2022
by   Chunyu Qiang, et al.
0

Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.

READ FULL TEXT

page 2

page 4

research
03/14/2023

Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Cross-speaker style transfer in speech synthesis aims at transferring a ...
research
07/27/2021

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

Cross-speaker style transfer is crucial to the applications of multi-sty...
research
10/29/2019

Disentangling the Spatial Structure and Style in Conditional VAE

This paper aims to disentangle the latent space in cVAE into the spatial...
research
06/07/2023

Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

With the demand for autonomous control and personalized speech generatio...
research
09/08/2021

Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis ai...
research
11/14/2021

Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning

The task of few-shot style transfer for voice cloning in text-to-speech ...
research
09/14/2023

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Direct speech-to-speech translation (S2ST) with discrete self-supervised...

Please sign up or login with your details

Forgot password? Click here to reset