PromptTTS: Controllable Text-to-Speech with Text Descriptions

11/22/2022
by   Zhifang Guo, et al.
0

Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this work, we explore the possibility of utilizing text descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech. Specifically, PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt, and a speech decoder to synthesize speech according to the extracted style and content representations. Compared with previous works in controllable TTS that require users to have acoustic knowledge to understand style factors such as prosody and pitch, PromptTTS is more user-friendly since text descriptions are a more natural way to express speech style (e.g., ”A lady whispers to her friend slowly”). Given that there is no TTS dataset with prompts, to benchmark the task of PromptTTS, we construct and release a dataset containing prompts with style and content information and the corresponding speech. Experiments show that PromptTTS can generate speech with precise style control and high speech quality. Audio samples and our dataset are publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/28/2023

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Recently, there has been a growing interest in the field of controllable...
research
10/23/2020

Show and Speak: Directly Synthesize Spoken Description of Images

This paper proposes a new model, referred to as the show and speak (SAS)...
research
08/14/2019

Dual Adversarial Inference for Text-to-Image Synthesis

Synthesizing images from a given text description involves engaging two ...
research
10/06/2021

Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

Controllable generative sequence models with the capability to extract a...
research
01/21/2022

Classroom Slide Narration System

Slide presentations are an effective and efficient tool used by the teac...
research
05/14/2020

S2IGAN: Speech-to-Image Generation via Adversarial Learning

An estimated half of the world's languages do not have a written form, m...
research
09/22/2020

PodSumm – Podcast Audio Summarization

The diverse nature, scale, and specificity of podcasts present a unique ...

Please sign up or login with your details

Forgot password? Click here to reset