Hierarchical Sequence to Sequence Voice Conversion with Limited Data

07/15/2019
by   Praveen Narayanan, et al.
4

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when <source,target> audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use F_0, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.

READ FULL TEXT

page 5

page 6

research
09/30/2020

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

This paper presents a novel framework to build a voice conversion (VC) s...
research
10/06/2020

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

This paper presents the sequence-to-sequence (seq2seq) baseline system f...
research
07/31/2021

Voice Reconstruction from Silent Speech with a Sequence-to-Sequence Model

Silent Speech Decoding (SSD) based on Surface electromyography (sEMG) ha...
research
07/20/2021

On Prosody Modeling for ASR+TTS based Voice Conversion

In voice conversion (VC), an approach showing promising results in the l...
research
11/09/2018

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

This paper describes a method based on a sequence-to-sequence learning (...
research
04/13/2019

Unsupervised Singing Voice Conversion

We present a deep learning method for singing voice conversion. The prop...
research
10/07/2021

Sequence-To-Sequence Voice Conversion using F0 and Time Conditioning and Adversarial Learning

This paper presents a sequence-to-sequence voice conversion (S2S-VC) alg...

Please sign up or login with your details

Forgot password? Click here to reset