StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

12/29/2022
by   Yinghao Aaron Li, et al.
0

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.

READ FULL TEXT
research
04/15/2020

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

Non-parallel many-to-many voice conversion remains an interesting but ch...
research
08/13/2020

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

The increased adoption of digital assistants makes text-to-speech (TTS) ...
research
02/22/2022

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Any-to-any voice conversion problem aims to convert voices for source an...
research
08/18/2022

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

One-shot voice conversion (VC) with only a single target speaker's speec...
research
09/23/2021

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

One-shot voice cloning aims to transform speaker voice and speaking styl...
research
06/18/2021

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

One-shot voice conversion (VC), which performs conversion across arbitra...
research
09/06/2023

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

We propose a highly controllable voice manipulation system that can perf...

Please sign up or login with your details

Forgot password? Click here to reset