PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

09/15/2023
by   Reo Shimizu, et al.
0

We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/28/2023

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Recently, there has been a growing interest in the field of controllable...
research
11/17/2022

Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models

There has been a significant progress in Text-To-Speech (TTS) synthesis ...
research
05/20/2021

Speaker disentanglement in video-to-speech conversion

The task of video-to-speech aims to translate silent video of lip moveme...
research
01/31/2023

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Expressive text-to-speech (TTS) aims to synthesize different speaking st...
research
06/13/2023

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that ...
research
09/16/2019

Communication-based Evaluation for Natural Language Generation

Natural language generation (NLG) systems are commonly evaluated using n...
research
05/13/2023

Vocal Style Factorization for Effective Speaker Recognition in Affective Scenarios

The accuracy of automated speaker recognition is negatively impacted by ...

Please sign up or login with your details

Forgot password? Click here to reset