Pretraining Techniques for Sequence-to-Sequence Voice Conversion

08/07/2020
by   Wen-Chin Huang, et al.
0

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR). We argue that VC models initialized with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech. We apply such techniques to recurrent neural network (RNN)-based and Transformer based models, and through systematical experiments, we demonstrate the effectiveness of the pretraining scheme and the superiority of Transformer based models over RNN-based models in terms of intelligibility, naturalness, and similarity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/14/2019

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC...
research
04/06/2019

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme (G2P) conversion is an important task in automatic s...
research
03/02/2023

LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion

As a key component of automated speech recognition (ASR) and the front-e...
research
10/22/2020

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

The neural network (NN) based singing voice synthesis (SVS) systems requ...
research
03/23/2023

Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognition

Transformer-based models have recently made significant achievements in ...
research
02/05/2020

Vocoder-free End-to-End Voice Conversion with Transformer Network

Mel-frequency filter bank (MFB) based approaches have the advantage of l...
research
10/10/2021

Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding

Recently, phonetic posteriorgrams (PPGs) based methods have been quite p...

Please sign up or login with your details

Forgot password? Click here to reset