U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

05/22/2023
by   Xin Jing, et al.
0

Deep learning has led to considerable advances in text-to-speech synthesis. Most recently, the adoption of Score-based Generative Models (SGMs), also known as Diffusion Probabilistic Models (DPMs), has gained traction due to their ability to produce high-quality synthesized neural speech in neural speech synthesis systems. In SGMs, the U-Net architecture and its variants have long dominated as the backbone since its first successful adoption. In this research, we mainly focus on the neural network in diffusion-model-based Text-to-Speech (TTS) systems and propose the U-DiT architecture, exploring the potential of vision transformer architecture as the core component of the diffusion models in a TTS system. The modular design of the U-DiT architecture, inherited from the best parts of U-Net and ViT, allows for great scalability and versatility across different data scales. The proposed U-DiT TTS system is a mel spectrogram-based acoustic model and utilizes a pretrained HiFi-GAN as the vocoder. The objective (ie Frechet distance) and MOS results show that our DiT-TTS system achieves state-of-art performance on the single speaker dataset LJSpeech. Our demos are publicly available at: https://eihw.github.io/u-dit-tts/

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2023

DiffVoice: Text-to-Speech with Latent Diffusion

In this work, we present DiffVoice, a novel text-to-speech model based o...
research
05/13/2021

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Recently, denoising diffusion probabilistic models and generative score ...
research
12/27/2022

Exploring Transformer Backbones for Image Diffusion Models

We present an end-to-end Transformer based Latent Diffusion model for im...
research
09/10/2023

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Although diffusion models in text-to-speech have become a popular choice...
research
07/31/2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Neural text-to-speech systems are often optimized on L1/L2 losses, which...
research
09/25/2022

All are Worth Words: a ViT Backbone for Score-based Diffusion Models

Vision transformers (ViT) have shown promise in various vision tasks inc...
research
03/10/2023

EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models

Electronic health records (EHR) contain vast biomedical knowledge and ar...

Please sign up or login with your details

Forgot password? Click here to reset