Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

08/21/2023
by   Heyang Xue, et al.
0

Despite imperfect score-matching causing drift in training and sampling distributions of diffusion models, recent advances in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example. However, the sampling drift problem leads to these approaches struggling in multi-speaker scenarios in practice due to more complex target data distribution compared to single-speaker scenarios. In this paper, we present Multi-GradSpeech, a multi-speaker diffusion-based acoustic models which introduces the Consistent Diffusion Model (CDM) as a generative modeling approach. We enforce the consistency property of CDM during the training process to alleviate the sampling drift problem in the inference stage, resulting in significant improvements in multi-speaker TTS performance. Our experimental results corroborate that our proposed approach can improve the performance of different speakers involved in multi-speaker TTS compared to Grad-TTS, even outperforming the fine-tuning approach. Audio samples are available at https://welkinyang.github.io/multi-gradspeech/

READ FULL TEXT
research
11/17/2022

Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models

There has been a significant progress in Text-To-Speech (TTS) synthesis ...
research
02/17/2023

Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent

Imperfect score-matching leads to a shift between the training and the s...
research
01/19/2023

Understanding the diffusion models by conditional expectations

This paper provide several mathematical analyses of the diffusion model ...
research
07/31/2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Neural text-to-speech systems are often optimized on L1/L2 losses, which...
research
09/10/2023

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Although diffusion models in text-to-speech have become a popular choice...
research
07/17/2023

Vocoder drift compensation by x-vector alignment in speaker anonymisation

For the most popular x-vector-based approaches to speaker anonymisation,...
research
09/24/2021

A data acquisition setup for data driven acoustic design

In this paper, we present a novel interdisciplinary approach to study th...

Please sign up or login with your details

Forgot password? Click here to reset