Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

06/07/2023
by   Wenhao Guan, et al.
0

With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-spectrograms. Specifically, a two-stage and a one-stage system are designed respectively, to improve the audio quality and the performance of style transfer. Secondly, a diffusion bridge of quantized VAE is designed to efficiently learn complex discrete style representations and improve the performance of style transfer. To have a better ability of style transfer, we introduce ControlVAE to improve the reconstruction quality and have good interpretability simultaneously. Experiments on LibriTTS dataset demonstrate that our method is more effective than baseline models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/06/2021

VAE based Text Style Transfer with Pivot Words Enhancement Learning

Text Style Transfer (TST) aims to alter the underlying style of the sour...
research
08/15/2023

StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models

Content and style (C-S) disentanglement is a fundamental problem and cri...
research
12/11/2018

Learning latent representations for style control and transfer in end-to-end speech synthesis

In this paper, we introduce the Variational Autoencoder (VAE) to an end-...
research
11/04/2022

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

Expressive text-to-speech (TTS) can synthesize a new speaking style by i...
research
12/13/2022

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Cross-speaker style transfer in speech synthesis aims at transferring a ...
research
02/10/2021

Self-Supervised VQ-VAE For One-Shot Music Style Transfer

Neural style transfer, allowing to apply the artistic style of one image...
research
05/21/2018

Invariant Representations from Adversarially Censored Autoencoders

We combine conditional variational autoencoders (VAE) with adversarial c...

Please sign up or login with your details

Forgot password? Click here to reset