DeepAI AI Chat
Log In Sign Up

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

by   Chin-Cheng Hsu, et al.
Academia Sinica

We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely limit the scope of practical applications of SC due to scarcity or even unavailability of parallel corpora. We propose an SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora. The framework comprises an encoder that learns speaker-independent phonetic representations and a decoder that learns to reconstruct the designated speaker. It removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system. We report objective and subjective evaluations to validate our proposed method and compare it to SC methods that have access to aligned corpora.


page 1

page 2

page 3

page 4


VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

Singing voice conversion aims to convert singer's voice from source to t...

Dictionary Update for NMF-based Voice Conversion Using an Encoder-Decoder Network

In this paper, we propose a dictionary update method for Nonnegative Mat...

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

Emotional voice conversion (EVC) aims to convert the emotion of speech f...

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

In this paper, a method for non-parallel sequence-to-sequence (seq2seq) ...

Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

This paper proposes an any-to-many location-relative, sequence-to-sequen...

Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

Speaking rate refers to the average number of phonemes within some unit ...

Interpretable AI for relating brain structural and functional connectomes

One of the central problems in neuroscience is understanding how brain s...