Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

05/11/2022
by   Jiachen Lian, et al.
0

Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC. In this study, we continue the direction by raising one concern about the prior distribution of content branch in the DSVAE baseline. We find the random initialized prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property. Here, we seek to achieve a better content embedding with more phonetic information preserved. We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling and reshapes the content embedding sampled from the posterior distribution. In our experiment on the VCTK dataset, we demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy, a stabilized vocalization and a better zero-shot VC performance compared with the competitive DSVAE baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2022

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

Traditional studies on voice conversion (VC) have made progress with par...
research
03/17/2021

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning

Voice style transfer, also called voice conversion, seeks to modify one ...
research
10/27/2021

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, i...
research
11/06/2021

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines

Nowadays, as more and more systems achieve good performance in tradition...
research
05/14/2019

Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Non-parallel many-to-many voice conversion, as well as zero-shot voice c...
research
05/14/2019

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Non-parallel many-to-many voice conversion, as well as zero-shot voice c...
research
02/27/2022

Variational Autoencoder with Disentanglement Priors for Low-Resource Task-Specific Natural Language Generation

In this paper, we propose a variational autoencoder with disentanglement...

Please sign up or login with your details

Forgot password? Click here to reset