Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

06/25/2019
by   Jing-Xuan Zhang, et al.
0

In this paper, a method for non-parallel sequence-to-sequence (seq2seq) voice conversion is presented. Disentangled linguistic and speaker representations are learned in proposed method. Our model is built under the framework of encoder-decoder. Encoders are used to extract hidden representations, which the decoder uses for recovering acoustic features. For learning disentangled linguistic representations, we proposed two strategies. First, additional text inputs are introduced for generating embeddings with audio inputs jointly. And hidden embeddings of text inputs are used as the reference for those of audio inputs. Second, an adversarial training method is adopted to further wipe out speaker related descriptions from the linguistic embeddings of audio signals. Meanwhile, speaker representations are extracted by a speaker encoder. Both the encoder for extracting linguistic representations from audio and the decoder themselves are built with seq2seq network. Therefore there is no constrain of the frame-by-frame conversion in our proposed method. For training the model, a strategy that consists of two stages is adopted. At the first stage, our model is pre-trained on a multi-speaker dataset. Then it is finetuned on the dataset of a specific conversion pair at the second stage. Experiments were conducted to compare the proposed method with both parallel and non-parallel baselines. The experimental results showed that our method obtained higher similarity and naturalness compared to the top rank method in Voice Conversion Challenge 2018, which is a non-parallel baseline. And the performance of proposed method was closed to the state-of-art seq2seq based parallel baseline. Ablation tests were also conducted to validate the proposed strategies for disentangling and pre-training.

READ FULL TEXT

page 1

page 10

research
08/05/2020

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

This paper presents an adversarial learning method for recognition-synth...
research
11/20/2018

Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision

This paper presents methods of making using of text supervision to impro...
research
01/18/2021

Hierarchical disentangled representation learning for singing voice conversion

Conventional singing voice conversion (SVC) methods often suffer from op...
research
09/06/2020

Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

This paper proposes an any-to-many location-relative, sequence-to-sequen...
research
10/13/2016

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

We propose a flexible framework for spectral conversion (SC) that facili...
research
04/02/2021

Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

In this paper, we pose the current state-of-the-art voice conversion (VC...
research
02/22/2021

Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion

Generative Adversarial Networks (GANs) are machine learning networks bas...

Please sign up or login with your details

Forgot password? Click here to reset