Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding

10/10/2021
by   Chao Wang, et al.
1

Recently, phonetic posteriorgrams (PPGs) based methods have been quite popular in non-parallel singing voice conversion systems. However, due to the lack of acoustic information in PPGs, style and naturalness of the converted singing voices are still limited. To solve these problems, in this paper, we utilize an acoustic reference encoder to implicitly model singing characteristics. We experiment with different auxiliary features, including mel spectrograms, HuBERT, and the middle hidden feature (PPG-Mid) of pretrained automatic speech recognition (ASR) model, as the input of the reference encoder, and finally find the HuBERT feature is the best choice. In addition, we use contrastive predictive coding (CPC) module to further smooth the voices by predicting future observations in latent space. Experiments show that, compared with the baseline models, our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer. Moreover, our proposed model can also make the speakers with just speech data sing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2022

Non-Parallel Voice Conversion for ASR Augmentation

Automatic speech recognition (ASR) needs to be robust to speaker differe...
research
10/16/2018

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

In this paper, a neural network named Sequence-to- sequence ConvErsion N...
research
03/11/2019

Singing voice conversion with non-parallel data

Singing voice conversion is a task to convert a song sang by a source si...
research
07/02/2022

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

Building a voice conversion system for noisy target speakers, such as us...
research
03/26/2020

Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression

In this paper, we integrate a simple non-parallel voice conversion (VC) ...
research
08/07/2020

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attracti...
research
05/04/2022

Zero-Episode Few-Shot Contrastive Predictive Coding: Solving intelligence tests without prior training

Video prediction models often combine three components: an encoder from ...

Please sign up or login with your details

Forgot password? Click here to reset