Adversarial Contrastive Predictive Coding for Unsupervised Learning of Disentangled Representations

05/26/2020
by   Janek Ebbers, et al.
0

In this work we tackle disentanglement of speaker and content related variations in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a speaker encoder. To foster disentanglement we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision, not even speaker labels. With successful disentanglement the model is able to perform voice conversion by recombining content and speaker attributes. Due to the speaker encoder which learns to extract speaker traits from an audio signal, the proposed model not only provides meaningful speaker embeddings but is also able to perform zero-shot voice conversion, i.e. with previously unseen source and target speakers. Compared to state-of-the-art disentanglement approaches we show competitive disentanglement and voice conversion performance for speakers seen during training and superior performance for unseen speakers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2019

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Recently, voice conversion (VC) without parallel data has been successfu...
research
07/11/2021

Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder

Voice conversion is a challenging task which transforms the voice charac...
research
05/04/2021

Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery

Discovering speaker independent acoustic units purely from spoken input ...
research
04/13/2021

NoiseVC: Towards High Quality Zero-Shot Voice Conversion

Voice conversion (VC) is a task that transforms voice from target audio ...
research
11/15/2022

Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder

By utilizing the fact that speaker identity and content vary on differen...
research
05/17/2022

Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay

Voice assistants record sound and can overhear conversations. Thus, a co...
research
01/11/2022

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Multi-speaker singing voice synthesis is to generate the singing voice s...

Please sign up or login with your details

Forgot password? Click here to reset