MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

02/25/2021
by   Takuhiro Kaneko, et al.
0

Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion despite recent advances in mel-spectrogram vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates an additional module called time-frequency adaptive normalization (TFAN), has been proposed. However, an increase in the number of learned parameters is imposed. As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames. This task allows the converter to learn time-frequency structures in a self-supervised manner and eliminates the need for an additional module such as TFAN. A subjective evaluation of the naturalness and speaker similarity showed that MaskCycleGAN-VC outperformed both CycleGAN-VC2 and CycleGAN-VC3 with a model size similar to that of CycleGAN-VC2. Audio samples are available at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

research
10/22/2020

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

Non-parallel voice conversion (VC) is a technique for learning mappings ...
research
06/10/2023

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks

Cycle-consistent generative adversarial networks have been widely used i...
research
07/29/2019

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

Non-parallel multi-domain voice conversion (VC) is a technique for learn...
research
04/02/2018

High-quality nonparallel voice conversion based on cycle-consistent adversarial network

Although voice conversion (VC) algorithms have achieved remarkable succe...
research
10/22/2020

Towards Low-Resource StarGAN Voice Conversion using Weight Adaptive Instance Normalization

Many-to-many voice conversion with non-parallel training data has seen s...
research
04/09/2019

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

Non-parallel voice conversion (VC) is a technique for learning the mappi...
research
04/15/2021

Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels

This paper presents a end-to-end framework for the F0 transformation in ...

Please sign up or login with your details

Forgot password? Click here to reset