Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

by   Alon Levkovitch, et al.

We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short ( 3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising model with a low-pass version of the new speaker's sample. The objective and subjective evaluations show that our sampling method can generate a voice similar to that of the target speaker in terms of frequency, with an accuracy comparable to state-of-the-art methods, and without training.


Voice Imitating Text-to-Speech Neural Networks

We propose a neural text-to-speech (TTS) model that can imitate a new sp...

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

In this paper we propose a method for end-to-end speech driven video edi...

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

With recent advancements in voice cloning, the performance of speech syn...

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

Singing voice conversion (SVC) is one promising technique which can enri...

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Building a high-quality singing corpus for a person who is not good at s...

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

For personalized speech generation, a neural text-to-speech (TTS) model ...

Please sign up or login with your details

Forgot password? Click here to reset