Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

10/23/2020
by   Wen-Chin Huang, et al.
0

We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/08/2021

Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker ...
research
04/07/2022

Self supervised learning for robust voice cloning

Voice cloning is a difficult task which requires robust and informative ...
research
04/09/2018

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

Recently, cycle-consistent adversarial network (Cycle-GAN) has been succ...
research
04/07/2021

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

Any-to-any voice conversion (VC) aims to convert the timbre of utterance...
research
09/05/2023

Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

Foreign accent conversion (FAC) is a special application of voice conver...
research
07/31/2021

Voice Reconstruction from Silent Speech with a Sequence-to-Sequence Model

Silent Speech Decoding (SSD) based on Surface electromyography (sEMG) ha...
research
07/03/2023

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Significant progress has been made in speaker dependent Lip-to-Speech sy...

Please sign up or login with your details

Forgot password? Click here to reset