ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

07/13/2022
by   Rongjie Huang, et al.
0

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at <https://ProDiff.github.io/.>

READ FULL TEXT

page 4

page 8

page 13

research
05/11/2023

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

Denoising diffusion probabilistic models (DDPMs) have shown promising pe...
research
01/28/2022

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Denoising diffusion probabilistic models (DDPMs) are expressive generati...
research
06/09/2023

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

Denoising Diffusion Probabilistic Models have shown extraordinary abilit...
research
05/24/2023

Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps

Denoising Diffusion Probabilistic Models (DDPM) have shown remarkable ef...
research
05/18/2023

Catch-Up Distillation: You Only Need to Train Once for Accelerating Sampling

Diffusion Probability Models (DPMs) have made impressive advancements in...
research
08/12/2023

Accelerating Diffusion-based Combinatorial Optimization Solvers by Progressive Distillation

Graph-based diffusion models have shown promising results in terms of ge...
research
09/18/2023

Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer

Diffusion based vocoders have been criticised for being slow due to the ...

Please sign up or login with your details

Forgot password? Click here to reset