DeepAI AI Chat
Log In Sign Up

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

by   Sicheng Yang, et al.
Tsinghua University

One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at


page 1

page 2

page 3

page 4


StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

One-shot voice conversion (VC) aims to convert speech from any source sp...

AVQVC: One-shot Voice Conversion by Vector Quantization with applying contrastive learning

Voice Conversion(VC) refers to changing the timbre of a speech while ret...

Disentangled Speech Representation Learning for One-Shot Cross-lingual Voice Conversion Using β-VAE

We propose an unsupervised learning method to disentangle speech into co...

Disentangled Speaker Representation Learning via Mutual Information Minimization

Domain mismatch problem caused by speaker-unrelated feature has been a m...

Unsupervised Speech Decomposition via Triple Information Bottleneck

Speech information can be roughly decomposed into four components: langu...

Mutual Information Maximization for Effective Lip Reading

Lip reading has received an increasing research interest in recent years...

Code Repositories


Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion (Interspeech 2022)

view repo