S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

04/07/2021
by   Jheng-Hao Lin, et al.
0

Any-to-any voice conversion (VC) aims to convert the timbre of utterances from and to any speakers seen or unseen during training. Various any-to-any VC approaches have been proposed like AUTOVC, AdaINVC, and FragmentVC. AUTOVC, and AdaINVC utilize source and target encoders to disentangle the content and speaker information of the features. FragmentVC utilizes two encoders to encode source and target information and adopts cross attention to align the source and target features with similar phonetic content. Moreover, pre-trained features are adopted. AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2.0 is used in FragmentVC to extract the phonetic content information. Different from previous works, we proposed S2VC that utilizes Self-Supervised features as both source and target features for VC model. Supervised phoneme posteriororgram (PPG), which is believed to be speaker-independent and widely used in VC to extract content information, is chosen as a strong baseline for SSL features. The objective evaluation and subjective evaluation both show models taking SSL feature CPC as both source and target features outperforms that taking PPG as source feature, suggesting that SSL features have great potential in improving VC.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2022

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Any-to-any voice conversion problem aims to convert voices for source an...
research
04/10/2019

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Recently, voice conversion (VC) without parallel data has been successfu...
research
02/06/2023

Autodecompose: A generative self-supervised model for semantic decomposition

We introduce Autodecompose, a novel self-supervised generative model tha...
research
08/09/2017

Weakly- and Self-Supervised Learning for Content-Aware Deep Image Retargeting

This paper proposes a weakly- and self-supervised deep convolutional neu...
research
10/23/2020

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

We present a novel approach to any-to-one (A2O) voice conversion (VC) in...
research
10/23/2020

Toward Expressive Singing Voice Correction: On Perceptual Validity of Evaluation Metrics for Vocal Melody Extraction

Singing voice correction (SVC) is an appealing application for amateur s...
research
10/27/2020

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

Any-to-any voice conversion aims to convert the voice from and to any sp...

Please sign up or login with your details

Forgot password? Click here to reset