On granularity of prosodic representations in expressive text-to-speech

01/26/2023
by   Mikolaj Babianski, et al.
0

In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90 sacrificing intelligibility.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2022

Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

This paper proposes an Expressive Speech Synthesis model that utilizes t...
research
11/02/2022

Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

A large part of the expressive speech synthesis literature focuses on le...
research
08/13/2021

Enhancing audio quality for expressive Neural Text-to-Speech

Artificial speech synthesis has made a great leap in terms of naturalnes...
research
11/19/2021

Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

This paper presents an expressive speech synthesis architecture for mode...
research
02/16/2022

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Expressive text-to-speech (TTS) has become a hot research topic recently...
research
03/20/2018

Expressivity in TTS from Semantics and Pragmatics

In this paper we present ongoing work to produce an expressive TTS reade...
research
08/26/2019

Multi-Granularity Representations of Dialog

Neural models of dialog rely on generalized latent representations of la...

Please sign up or login with your details

Forgot password? Click here to reset