Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

04/03/2021
by   Myeonghun Jeong, et al.
0

Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/11/2023

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

Denoising diffusion probabilistic models (DDPMs) have shown promising pe...
research
01/28/2022

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Denoising diffusion probabilistic models (DDPMs) are expressive generati...
research
05/10/2023

Diffusion-based Signal Refiner for Speech Separation

We have developed a diffusion-based speech refiner that improves the ref...
research
08/31/2023

LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech

Recent advances in neural text-to-speech (TTS) models bring thousands of...
research
05/13/2021

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Recently, denoising diffusion probabilistic models and generative score ...
research
07/31/2023

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Expressive text-to-speech systems have undergone significant advancement...
research
03/15/2023

Speech Signal Improvement Using Causal Generative Diffusion Models

In this paper, we present a causal speech signal improvement system that...

Please sign up or login with your details

Forgot password? Click here to reset