FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

10/27/2020
by   Yist Y. Lin, et al.
0

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn't require parallel data. Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC.

READ FULL TEXT
research
06/19/2021

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

One-shot voice conversion has received significant attention since only ...
research
04/10/2019

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Recently, voice conversion (VC) without parallel data has been successfu...
research
04/09/2018

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

Recently, cycle-consistent adversarial network (Cycle-GAN) has been succ...
research
04/05/2022

Design Guidelines for Inclusive Speaker Verification Evaluation Datasets

Speaker verification (SV) provides billions of voice-enabled devices wit...
research
07/15/2023

Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

Synthetic-voice cloning technologies have seen significant advances in r...
research
04/07/2021

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

Any-to-any voice conversion (VC) aims to convert the timbre of utterance...
research
10/14/2021

Toward Degradation-Robust Voice Conversion

Any-to-any voice conversion technologies convert the vocal timbre of an ...

Please sign up or login with your details

Forgot password? Click here to reset