F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

04/15/2020
by   Kaizhi Qian, et al.
7

Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have been proposed. Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice. However, we found that while speaker identity is disentangled from speech content, a significant amount of prosodic information, such as source F0, leaks through the bottleneck, causing target F0 to fluctuate unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus unsuitable for many applications. In the paper, we modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. Therefore, we can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity. We support our improvement through quantitative and qualitative analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/08/2022

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Non-parallel many-to-many voice conversion remains an interesting but ch...
research
07/13/2022

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Voice conversion is to generate a new speech with the source content and...
research
08/24/2018

Voice Conversion with Conditional SampleRNN

Here we present a novel approach to conditioning the SampleRNN generativ...
research
12/29/2022

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

One-shot voice conversion (VC) aims to convert speech from any source sp...
research
12/03/2019

Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders

We propose a flexible framework that deals with both singer conversion a...
research
10/05/2021

Voice Aging with Audio-Visual Style Transfer

Face aging techniques have used generative adversarial networks (GANs) a...
research
10/30/2018

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

This paper focuses on using voice conversion (VC) to improve the speech ...

Please sign up or login with your details

Forgot password? Click here to reset