Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

07/31/2023
by   Guangyan Zhang, et al.
0

Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2023

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Although diffusion models in text-to-speech have become a popular choice...
research
08/21/2023

Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

Despite imperfect score-matching causing drift in training and sampling ...
research
11/13/2022

OverFlow: Putting flows on top of neural transducers for better TTS

Neural HMMs are a type of neural transducer recently proposed for sequen...
research
05/22/2023

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Deep learning has led to considerable advances in text-to-speech synthes...
research
05/06/2021

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Singing voice synthesis (SVS) system is built to synthesize high-quality...
research
04/23/2023

DiffVoice: Text-to-Speech with Latent Diffusion

In this work, we present DiffVoice, a novel text-to-speech model based o...
research
05/22/2023

Duplex Diffusion Models Improve Speech-to-Speech Translation

Speech-to-speech translation is a typical sequence-to-sequence learning ...

Please sign up or login with your details

Forgot password? Click here to reset