Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

07/13/2022
by   Jian Ma, et al.
0

Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.

READ FULL TEXT

page 1

page 2

page 4

page 5

page 10

page 11

research
04/15/2020

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

Non-parallel many-to-many voice conversion remains an interesting but ch...
research
08/10/2020

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

Singing voice conversion aims to convert singer's voice from source to t...
research
09/06/2023

Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data

While many recent any-to-any voice conversion models succeed in transfer...
research
03/16/2023

TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion

Voice Conversion (VC) must be achieved while maintaining the content of ...
research
10/13/2016

Dictionary Update for NMF-based Voice Conversion Using an Encoder-Decoder Network

In this paper, we propose a dictionary update method for Nonnegative Mat...
research
05/28/2020

Speech-to-Singing Conversion based on Boundary Equilibrium GAN

This paper investigates the use of generative adversarial network (GAN)-...
research
06/27/2021

AI based Presentation Creator With Customized Audio Content Delivery

In this paper, we propose an architecture to solve a novel problem state...

Please sign up or login with your details

Forgot password? Click here to reset