TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

by   Shengpeng Ji, et al.

Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to the scarcity of high-quality speech datasets with natural text style prompt and the absence of advanced text-controllable TTS models. In light of this, 1) we propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes. The dataset comprises 236,220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples. Through iterative experimentation, we introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes. 2) Furthermore, to address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle. This architecture treats text controllable TTS as a language model task, utilizing audio codec codes as an intermediate representation to replace the conventional mel-spectrogram. Finally, we successfully demonstrate the ability of the proposed model by showing a comparable performance in the controllable TTS task. Audio samples are available at https://sall-e.github.io/


page 1

page 2

page 3

page 4


PromptTTS: Controllable Text-to-Speech with Text Descriptions

Using a text description as prompt to guide the generation of text or im...

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis sy...

Audio Generation with Multiple Conditional Diffusion Model

Text-based audio generation models have limitations as they cannot encom...

Prosody-controllable spontaneous TTS with neural HMMs

Spontaneous speech has many affective and pragmatic functions that are i...

Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow

There are two types of methods for non-autoregressive text-to-speech mod...

PoeticTTS – Controllable Poetry Reading for Literary Studies

Speech synthesis for poetry is challenging due to specific intonation pa...

Sudowoodo: a Chinese Lyric Imitation System with Source Lyrics

Lyrics generation is a well-known application in natural language genera...

Please sign up or login with your details

Forgot password? Click here to reset