Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

03/29/2019
by   Mingyang Zhang, et al.
0

We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks. We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks. This model can accomplish these two different tasks respectively according to the type of input. An end-to-end speech synthesis task is conducted when the model is given text as the input while a sequence-to-sequence voice conversion task is conducted when it is given the speech of a source speaker as the input. Waveform signals are generated by using WaveNet, which is conditioned by using a predicted mel-spectrogram. We propose jointly training a shared model as a decoder for a target speaker that supports multiple sources. Listening experiments show that our proposed multi-source encoder-decoder model can efficiently achieve both the TTS and VC tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2020

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

This paper presents a novel framework to build a voice conversion (VC) s...
research
10/07/2021

Sequence-To-Sequence Voice Conversion using F0 and Time Conditioning and Adversarial Learning

This paper presents a sequence-to-sequence voice conversion (S2S-VC) alg...
research
04/10/2017

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice conversion (VC) using sequence-to-sequence learning of context pos...
research
11/09/2018

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

This paper describes a method based on a sequence-to-sequence learning (...
research
07/11/2021

A Deep-Bayesian Framework for Adaptive Speech Duration Modification

We propose the first method to adaptively modify the duration of a given...
research
09/06/2020

Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

This paper proposes an any-to-many location-relative, sequence-to-sequen...
research
03/15/2022

Text-free non-parallel many-to-many voice conversion using normalising flows

Non-parallel voice conversion (VC) is typically achieved using lossy rep...

Please sign up or login with your details

Forgot password? Click here to reset