VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

06/07/2020
by   Da-Yi Wu, et al.
0

Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content. It is still a challenging work, especially in a one-shot setting. Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity, so these methods can further generalize to unseen speakers. The disentangle capability is achieved by vector quantization (VQ), adversarial training, or instance normalization (IN). However, the imperfect disentanglement may harm the quality of output speech. In this work, to further improve audio quality, we use the U-Net architecture within an auto-encoder-based VC system. We find that to leverage the U-Net architecture, a strong information bottleneck is necessary. The VQ-based method, which quantizes the latent vectors, can serve the purpose. The objective and the subjective evaluations show that the proposed method performs well in both audio naturalness and speaker similarity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2020

AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization

Recently, voice conversion (VC) has been widely studied. Many VC systems...
research
06/02/2021

NVC-Net: End-to-End Adversarial Voice Conversion

Voice conversion has gained increasing popularity in many applications o...
research
05/28/2019

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

We present an unsupervised end-to-end training scheme where we discover ...
research
06/28/2022

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Typically, singing voice conversion (SVC) depends on an embedding vector...
research
03/28/2019

Adversarial Approximate Inference for Speech to Electroglottograph Conversion

Speech produced by human vocal apparatus conveys substantial non-semanti...
research
06/25/2021

Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

Generally speaking, the main objective when training a neural speech syn...
research
04/08/2022

Analysis and transformations of intensity in singing voice

In this paper we introduce a neural auto-encoder that transforms the voi...

Please sign up or login with your details

Forgot password? Click here to reset