DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

07/31/2023
by   Hyung-Seok Oh, et al.
0

Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which expressive speech is synthesized using a diffusion-based latent prosody generator and prosody conditional adversarial training. Our findings confirm the effectiveness of our prosody generator in generating a prosody vector. Furthermore, our prosody conditional discriminator significantly improves the quality of the generated speech by accurately emulating prosody. We use denoising diffusion generative adversarial networks to improve the prosody generation speed. Consequently, DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model. The superior performance of our proposed method has been demonstrated via experiments.

READ FULL TEXT

page 1

page 8

page 10

page 12

research
08/03/2023

Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS

The diffusion model is capable of generating high-quality data through a...
research
10/27/2020

Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

This paper proposes voicing-aware conditional discriminators for Paralle...
research
11/29/2021

Vector Quantized Diffusion Model for Text-to-Image Synthesis

We present the vector quantized diffusion (VQ-Diffusion) model for text-...
research
10/03/2022

WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

Denoising diffusion probabilistic models (DDPMs) and generative adversar...
research
04/03/2021

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Although neural text-to-speech (TTS) models have attracted a lot of atte...
research
09/13/2023

DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation

In the Text-to-speech(TTS) task, the latent diffusion model has excellen...
research
12/22/2020

Generating Long-term Continuous Multi-type Generation Profiles using Generative Adversarial Network

Today, the adoption of new technologies has increased power system dynam...

Please sign up or login with your details

Forgot password? Click here to reset