Vocoder-free End-to-End Voice Conversion with Transformer Network

by   June-Woo Kim, et al.

Mel-frequency filter bank (MFB) based approaches have the advantage of learning speech compared to raw spectrum since MFB has less feature size. However, speech generator with MFB approaches require additional vocoder that needs a huge amount of computation expense for training process. The additional pre/post processing such as MFB and vocoder is not essential to convert real human speech to others. It is possible to only use the raw spectrum along with the phase to generate different style of voices with clear pronunciation. In this regard, we propose a fast and effective approach to convert realistic voices using raw spectrum in a parallel manner. Our transformer-based model architecture which does not have any CNN or RNN layers has shown the advantage of learning fast and solved the limitation of sequential computation of conventional RNN. In this paper, we introduce a vocoder-free end-to-end voice conversion method using transformer network. The presented conversion model can also be used in speaker adaptation for speech recognition. Our approach can convert the source voice to a target voice without using MFB and vocoder. We can get an adapted MFB for speech recognition by multiplying the converted magnitude with phase. We perform our voice conversion experiments on TIDIGITS dataset using the metrics such as naturalness, similarity, and clarity with mean opinion score, respectively.


page 1

page 5

page 6


Singing voice conversion with non-parallel data

Singing voice conversion is a task to convert a song sang by a source si...

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Any-to-any voice conversion problem aims to convert voices for source an...

NVC-Net: End-to-End Adversarial Voice Conversion

Voice conversion has gained increasing popularity in many applications o...

Many-to-Many Voice Transformer Network

This paper proposes a voice conversion (VC) method based on a sequence-t...

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attracti...

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC...

Axial Residual Networks for CycleGAN-based Voice Conversion

We propose a novel architecture and improved training objectives for non...

Please sign up or login with your details

Forgot password? Click here to reset