StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

06/13/2023
by   Yinghao Aaron Li, et al.
0

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

READ FULL TEXT
research
04/23/2023

DiffVoice: Text-to-Speech with Latent Diffusion

In this work, we present DiffVoice, a novel text-to-speech model based o...
research
11/17/2022

Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models

There has been a significant progress in Text-To-Speech (TTS) synthesis ...
research
06/06/2023

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Scaling text-to-speech to a large and wild dataset has been proven to be...
research
09/15/2023

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis sy...
research
07/26/2022

Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models

Novel architectures have recently improved generative image synthesis le...
research
05/09/2022

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Text to speech (TTS) has made rapid progress in both academia and indust...
research
07/13/2023

AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion

Large-scale pre-trained vision-language models allow for the zero-shot t...

Please sign up or login with your details

Forgot password? Click here to reset