Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

11/01/2022
by   Karolos Nikitaras, et al.
0

This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step. Both qualitative and quantitative evaluations are used to demonstrate the effectiveness of the proposed approach. Audio samples are available in our demo page.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2022

Fine-grained Noise Control for Multispeaker Speech Synthesis

A text-to-speech (TTS) model typically factorizes speech attributes such...
research
01/26/2023

On granularity of prosodic representations in expressive text-to-speech

In expressive speech synthesis it is widely adopted to use latent prosod...
research
10/27/2022

FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

Conversational Text-to-Speech (TTS) aims to synthesis an utterance with ...
research
02/06/2020

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

Recent neural text-to-speech (TTS) models with fine-grained latent featu...
research
02/06/2020

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

This paper proposes a hierarchical, fine-grained and interpretable laten...
research
03/06/2021

Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

In this paper, we study the controllability of an Expressive TTS system ...
research
08/22/2019

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Diverse and accurate vision+language modeling is an important goal to re...

Please sign up or login with your details

Forgot password? Click here to reset