DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion

05/25/2023
by   Ha-Yeong Choi, et al.
0

Diffusion-based generative models have exhibited powerful generative performance in recent years. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, this paper presents decoupled denoising diffusion models (DDDMs) with disentangled representations, which can control the style for each attribute in generative models. We apply DDDMs to voice conversion (VC) tasks to address the challenges of disentangling and controlling each speech attribute (e.g., linguistic information, intonation, and timbre). First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled representations for denoising with respect to each attribute. Moreover, we also propose the prior mixup for robust voice style transfer, which uses the converted representation of the mixed style as a prior distribution for the diffusion models. The experimental results reveal that our method outperforms publicly available VC models. Furthermore, we show that our method provides robust generative performance regardless of the model size. Audio samples are available https://hayeong0.github.io/DDDM-VC-demo/.

READ FULL TEXT
research
01/30/2021

Adversarially learning disentangled speech representations for robust multi-factor voice conversion

Factorizing speech as disentangled speech representations is vital to ac...
research
08/28/2023

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Voice conversion is a method that allows for the transformation of speak...
research
08/15/2023

StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models

Content and style (C-S) disentanglement is a fundamental problem and cri...
research
02/22/2022

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Any-to-any voice conversion problem aims to convert voices for source an...
research
01/18/2021

Hierarchical disentangled representation learning for singing voice conversion

Conventional singing voice conversion (SVC) methods often suffer from op...
research
05/28/2021

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

Singing voice conversion (SVC) is one promising technique which can enri...
research
03/24/2022

Learning Disentangled Representation for One-shot Progressive Face Swapping

Although face swapping has attracted much attention in recent years, it ...

Please sign up or login with your details

Forgot password? Click here to reset