CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

05/11/2023
by   Zhen Ye, et al.
0

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples are available at https://comospeech.github.io/.

READ FULL TEXT

page 3

page 6

page 10

research
04/03/2021

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Although neural text-to-speech (TTS) models have attracted a lot of atte...
research
07/13/2022

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Denoising diffusion probabilistic models (DDPMs) have recently achieved ...
research
06/09/2023

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

Denoising Diffusion Probabilistic Models have shown extraordinary abilit...
research
09/10/2023

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Although diffusion models in text-to-speech have become a popular choice...
research
08/09/2020

SpeedySpeech: Efficient Neural Speech Synthesis

While recent neural sequence-to-sequence models have greatly improved th...
research
09/13/2023

DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation

In the Text-to-speech(TTS) task, the latent diffusion model has excellen...
research
02/23/2018

Efficient Neural Audio Synthesis

Sequential models achieve state-of-the-art results in audio, visual and ...

Please sign up or login with your details

Forgot password? Click here to reset