Using a Large Language Model to Control Speaking Style for Expressive TTS

05/17/2023
by   Atli Thor Sigurgeirsson, et al.
0

Appropriate prosody is critical for successful spoken communication. Contextual word embeddings are proven to be helpful in predicting prosody but do not allow for choosing between plausible prosodic renditions. Reference-based TTS models attempt to address this by conditioning speech generation on a reference speech sample. These models can generate expressive speech but this requires finding an appropriate reference. Sufficiently large generative language models have been used to solve various language-related tasks. We explore whether such models can be used to suggest appropriate prosody for expressive TTS. We train a TTS model on a non-expressive corpus and then prompt the language model to suggest changes to pitch, energy and duration. The prompt can be designed for any task and we prompt the model to make suggestions based on target speaking style and dialogue context. The proposed method is rated most appropriate in 49.9% of cases compared to 31.0% for a baseline model.

READ FULL TEXT
research
07/13/2022

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Expressive text-to-speech has shown improved performance in recent years...
research
11/19/2021

Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

This paper presents an expressive speech synthesis architecture for mode...
research
06/22/2023

AudioPaLM: A Large Language Model That Can Speak and Listen

We introduce AudioPaLM, a large language model for speech understanding ...
research
01/31/2023

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Expressive text-to-speech (TTS) aims to synthesize different speaking st...
research
05/22/2023

LM-Switch: Lightweight Language Model Conditioning in Word Embedding Space

In recent years, large language models (LMs) have achieved remarkable pr...
research
06/08/2022

Few-shot Prompting Towards Controllable Response Generation

Much literature has shown that prompt-based learning is an efficient met...
research
10/28/2022

Modeling structure-building in the brain with CCG parsing and large language models

To model behavioral and neural correlates of language comprehension in n...

Please sign up or login with your details

Forgot password? Click here to reset