FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

10/27/2022
by   Jingyi Li, et al.
0

Voice conversion (VC) can be achieved by first extracting source content information and target speaker information, and then reconstructing waveform with these information. However, current approaches normally either extract dirty content information with speaker information leaked in, or demand a large amount of annotated data for training. Besides, the quality of reconstructed waveform can be degraded by the mismatch between conversion model and vocoder. In this paper, we adopt the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation. We disentangle content information by imposing an information bottleneck to WavLM features, and propose the spectrogram-resize based data augmentation to improve the purity of extracted content information. Experimental results show that the proposed method outperforms the latest VC models trained with annotated data and has greater robustness.

READ FULL TEXT
research
06/15/2022

End-to-End Voice Conversion with Information Perturbation

The ideal goal of voice conversion is to convert the source speaker's sp...
research
03/31/2022

HiFi-VC: High Quality ASR-Based Voice Conversion

The goal of voice conversion (VC) is to convert input voice to match the...
research
10/20/2022

Robust One-Shot Singing Voice Conversion

Many existing works on singing voice conversion (SVC) require clean reco...
research
03/24/2022

Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion

Non-parallel data voice conversion (VC) have achieved considerable break...
research
11/21/2020

Exploring Voice Conversion based Data Augmentation in Text-Dependent Speaker Verification

In this paper, we focus on improving the performance of the text-depende...
research
10/29/2018

Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics

An audiovisual speaker conversion method is presented for simultaneously...
research
03/28/2023

Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion

We propose a high-quality 3D-to-3D conversion method, Instruct 3D-to-3D....

Please sign up or login with your details

Forgot password? Click here to reset