Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

11/19/2021
by   Konstantinos Klapsas, et al.
0

This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level sequence conditioned only on the phonetic information in order to disentangle it from the style information. The two encoder outputs are aligned and concatenated with the phoneme encoder outputs and then decoded with a Non-Attentive Tacotron model. An extra prior encoder is used to predict the style tokens autoregressively, in order for the model to be able to run without a reference utterance. We find that the resulting model gives both word-level and global control over the style, as well as prosody transfer capabilities.

READ FULL TEXT
research
11/01/2017

Uncovering Latent Style Factors for Expressive Speech Synthesis

Prosodic modeling is a core problem in speech synthesis. The key challen...
research
08/30/2023

The DeepZen Speech Synthesis System for Blizzard Challenge 2023

This paper describes the DeepZen text to speech (TTS) system for Blizzar...
research
05/17/2023

Using a Large Language Model to Control Speaking Style for Expressive TTS

Appropriate prosody is critical for successful spoken communication. Con...
research
04/04/2019

Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis

Speech style control and transfer techniques aim to enrich the diversity...
research
01/26/2023

On granularity of prosodic representations in expressive text-to-speech

In expressive speech synthesis it is widely adopted to use latent prosod...
research
07/29/2023

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Expressive speech synthesis is crucial for many human-computer interacti...
research
10/12/2021

Fine-grained style control in Transformer-based Text-to-speech Synthesis

In this paper, we present a novel architecture to realize fine-grained s...

Please sign up or login with your details

Forgot password? Click here to reset