Many-to-Many Voice Transformer Network

05/18/2020
by   Hirokazu Kameoka, et al.
6

This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework, which makes it possible to simultaneously convert the voice characteristics, pitch contour and duration of input speech. We previously proposed an S2S-based VC method using a transformer network architecture, which we call the "voice transformer network (VTN)". While the original VTN is designed to learn only a mapping of speech feature sequences from one domain into another, we extend it so that it can simultaneously learn mappings among multiple domains using only a single model. This allows the model to fully utilize available training data collected from multiple domains by capturing common latent features that can be shared across different domains. On top of this model, we further propose incorporating a training loss called the "identity mapping loss", to ensure that the input feature sequence will remain unchanged when it already belongs to the target domain. Using this particular loss for model training has been found to be extremely effective in improving the performance of the model at test time. We conducted speaker identity conversion experiments and showed that model obtained higher sound quality and speaker similarity than baseline methods.

READ FULL TEXT

page 2

page 3

page 4

page 8

page 9

page 10

page 11

page 12

research
11/05/2018

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

This paper proposes a voice conversion method based on fully convolution...
research
11/17/2020

Optimizing voice conversion network with cycle consistency loss of speaker identity

We propose a novel training scheme to optimize voice conversion network ...
research
04/14/2021

Non-autoregressive sequence-to-sequence voice conversion

This paper proposes a novel voice conversion (VC) method based on non-au...
research
08/27/2020

Non-Parallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks

We have previously proposed a method that allows for non-parallel voice ...
research
02/05/2020

Vocoder-free End-to-End Voice Conversion with Transformer Network

Mel-frequency filter bank (MFB) based approaches have the advantage of l...
research
10/22/2020

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

The neural network (NN) based singing voice synthesis (SVS) systems requ...
research
04/08/2022

Enhanced exemplar autoencoder with cycle consistency loss in any-to-one voice conversion

Recent research showed that an autoencoder trained with speech of a sing...

Please sign up or login with your details

Forgot password? Click here to reset