DiffVoice: Text-to-Speech with Latent Diffusion

04/23/2023
by   Zhijun Liu, et al.
0

In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion. We propose to first encode speech signals into a phoneme-rate latent representation with a variational autoencoder enhanced by adversarial training, and then jointly model the duration and the latent representation with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness. By adopting recent generative inverse problem solving algorithms for diffusion models, DiffVoice achieves the state-of-the-art performance in text-based speech editing, and zero-shot adaptation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2023

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that ...
research
05/22/2023

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Deep learning has led to considerable advances in text-to-speech synthes...
research
05/13/2021

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Recently, denoising diffusion probabilistic models and generative score ...
research
07/28/2023

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Recently, there has been a growing interest in text-to-speech (TTS) meth...
research
09/12/2021

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Given a piece of speech and its transcript text, text-based speech editi...
research
06/13/2023

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

The utilization of discrete speech tokens, divided into semantic tokens ...
research
07/31/2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Neural text-to-speech systems are often optimized on L1/L2 losses, which...

Please sign up or login with your details

Forgot password? Click here to reset