Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

03/30/2022
by   Jiachen Lian, et al.
0

Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE). A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to the VAE decoder. Besides that, an on-the-fly data augmentation training strategy is applied to make the learned representation noise invariant. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e., voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.

READ FULL TEXT
research
03/18/2022

DGC-vector: A new speaker embedding for zero-shot voice conversion

Recently, more and more zero-shot voice conversion algorithms have been ...
research
03/17/2021

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning

Voice style transfer, also called voice conversion, seeks to modify one ...
research
10/27/2021

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, i...
research
05/11/2022

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

Disentangling content and speaking style information is essential for ze...
research
03/30/2022

Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE

Variational auto-encoder(VAE) is an effective neural network architectur...
research
09/09/2022

DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion

The widespread adoption of speech-based online services raises security ...
research
05/09/2023

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speec...

Please sign up or login with your details

Forgot password? Click here to reset