Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

12/08/2021
by   Trung Dang, et al.
0

Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c) adding a speaker classification loss can help to learn a better speaker embedding. Our model trained on LibriTTS using these techniques achieves the best performance, producing audio samples transferred well to the target speaker's voice, while preserving the linguistic content that is comparable with actual human utterances in terms of Character Error Rate.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2023

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

In this work, we propose a zero-shot voice conversion method using speec...
research
10/23/2020

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

We present a novel approach to any-to-one (A2O) voice conversion (VC) in...
research
08/24/2023

WavMark: Watermarking for Audio Generation

Recent breakthroughs in zero-shot voice synthesis have enabled imitating...
research
04/07/2022

Self supervised learning for robust voice cloning

Voice cloning is a difficult task which requires robust and informative ...
research
03/03/2023

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

Recognizing whispered speech and converting it to normal speech creates ...
research
10/27/2021

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, i...
research
06/16/2021

Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Voice Conversion (VC) is a technique that aims to transform the non-ling...

Please sign up or login with your details

Forgot password? Click here to reset