SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs

07/18/2023
by   Yinghao Aaron Li, et al.
0

In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement. These applications typically involve mapping text or speech inputs to pre-trained SLM representations, from which target speech is decoded. This paper introduces a new approach, SLMGAN, to leverage SLM representations for discriminative tasks within the generative adversarial network (GAN) framework, specifically for voice conversion. Building upon StarGANv2-VC, we add our novel SLM-based WavLM discriminators on top of the mel-based discriminators along with our newly designed SLM feature matching loss function, resulting in an unsupervised zero-shot voice conversion system that does not require text labels during training. Subjective evaluation results show that SLMGAN outperforms existing state-of-the-art zero-shot voice conversion models in terms of naturalness and achieves comparable similarity, highlighting the potential of SLM-based discriminators for related applications.

READ FULL TEXT
research
12/04/2021

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

YourTTS brings the power of a multilingual approach to the task of zero-...
research
10/03/2022

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Data-driven speech processing models usually perform well with a large a...
research
06/23/2021

Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens

In this paper, we propose a novel spoken-text-style conversion method th...
research
06/28/2023

Fake the Real: Backdoor Attack on Deep Speech Classification via Voice Conversion

Deep speech classification has achieved tremendous success and greatly p...
research
05/14/2019

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Non-parallel many-to-many voice conversion, as well as zero-shot voice c...
research
10/20/2022

Robust One-Shot Singing Voice Conversion

Many existing works on singing voice conversion (SVC) require clean reco...

Please sign up or login with your details

Forgot password? Click here to reset