Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

10/27/2021
by   Shijun Wang, et al.
5

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive topic due to its usefulness in real use-case scenarios. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics. Although crucial, extracting disentangled prosody characteristics for unseen speakers remains an open issue. In this paper, we propose a novel self-supervised approach to effectively learn the prosody characteristics. Then, we use the learned prosodic representations to train our VC model for zero-shot conversion. Our evaluation demonstrates that we can efficiently extract disentangled prosody representation. Moreover, we show improved performance compared to the state-of-the-art zero-shot VC models.

READ FULL TEXT
research
03/30/2022

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

Traditional studies on voice conversion (VC) have made progress with par...
research
04/24/2023

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a se...
research
05/19/2022

Voice Activity Projection: Self-supervised Learning of Turn-taking Events

The modeling of turn-taking in dialog can be viewed as the modeling of t...
research
05/11/2022

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

Disentangling content and speaking style information is essential for ze...
research
09/18/2023

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

This paper presents a novel task, zero-shot voice conversion based on fa...
research
12/08/2021

Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker ...
research
04/13/2021

NoiseVC: Towards High Quality Zero-Shot Voice Conversion

Voice conversion (VC) is a task that transforms voice from target audio ...

Please sign up or login with your details

Forgot password? Click here to reset