StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

05/30/2022
by   Yinghao Aaron Li, et al.
0

Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.

READ FULL TEXT

page 7

page 16

page 20

research
11/02/2022

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

Disentanglement of a speaker's timbre and style is very important for st...
research
05/19/2023

MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting

We present MParrotTTS, a unified multilingual, multi-speaker text-to-spe...
research
04/04/2019

In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

Neural text-to-speech synthesis (NTTS) models have shown significant pro...
research
05/22/2020

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet hav...
research
02/01/2021

Universal Neural Vocoding with Parallel WaveNet

We present a universal neural vocoder based on Parallel WaveNet, with an...
research
07/06/2021

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

While recent text to speech (TTS) models perform very well in synthesizi...
research
08/10/2023

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Recent work has shown that it is possible to resynthesize high-quality s...

Please sign up or login with your details

Forgot password? Click here to reset