Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

04/09/2018
by   Ju-chieh Chou, et al.
0

Recently, cycle-consistent adversarial network (Cycle-GAN) has been successfully applied to voice conversion to a different speaker without parallel data, although in those approaches an individual model is needed for each target speaker. In this paper, we propose an adversarial learning framework for voice conversion, with which a single model can be trained to convert the voice to many different speakers, all without parallel data, by separating the speaker characteristics from the linguistic content in speech signals. An autoencoder is first trained to extract speaker-independent latent representations and speaker embedding separately using another auxiliary speaker classifier to regularize the latent representation. The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance. The quality of decoder output is further improved by patching with the residual signal produced by another pair of generator and discriminator. A target speaker set size of 20 was tested in the preliminary experiments, and very good voice quality was obtained. Conventional voice conversion metrics are reported. We also show that the speaker information has been properly reduced from the latent representations.

READ FULL TEXT
research
08/05/2020

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

This paper presents an adversarial learning method for recognition-synth...
research
06/25/2021

Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

Generally speaking, the main objective when training a neural speech syn...
research
08/09/2018

Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

Speaking rate refers to the average number of phonemes within some unit ...
research
04/08/2022

Enhanced exemplar autoencoder with cycle consistency loss in any-to-one voice conversion

Recent research showed that an autoencoder trained with speech of a sing...
research
05/04/2021

Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

This work examines the content and usefulness of disentangled phone and ...
research
10/27/2020

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

Any-to-any voice conversion aims to convert the voice from and to any sp...
research
10/23/2020

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

We present a novel approach to any-to-one (A2O) voice conversion (VC) in...

Please sign up or login with your details

Forgot password? Click here to reset