Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker

01/29/2023
by   Navjot Kaur, et al.
0

Voice synthesis has seen significant improvements in the past decade resulting in highly intelligible voices. Further investigations have resulted in models that can produce variable speech, including conditional emotional expression. The problem lies, however, in a focus on phrase-level modifications and prosodic vocal features. Using the CREMA-D dataset we have trained a GAN conditioned on emotion to generate worth lengths for a given input text. These word lengths are relative to neutral speech and can be provided, through speech synthesis markup language (SSML) to a text-to-speech (TTS) system to generate more expressive speech. Additionally, a generative model is also trained using implicit maximum likelihood estimation (IMLE) and a comparative analysis with GANs is included. We were able to achieve better performances on objective measures for neutral speech, and better time alignment for happy speech when compared to an out-of-box model. However, further investigation of subjective evaluation is required.

READ FULL TEXT

page 8

page 9

page 10

research
06/17/2021

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Recently, there has been an increasing interest in neural speech synthes...
research
11/03/2020

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

Emotional voice conversion (EVC) aims to convert the emotion of speech f...
research
04/03/2021

Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability

Emotional text-to-speech synthesis (ETTS) has seen much progress in rece...
research
06/30/2022

Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems

This paper proposes an effective emotional text-to-speech (TTS) system w...
research
04/08/2022

Karaoker: Alignment-free singing voice synthesis with speech training data

Existing singing voice synthesis models (SVS) are usually trained on sin...
research
06/01/2018

Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks

Articulation, emotion, and personality play strong roles in the orofacia...
research
02/18/2021

AudioVisual Speech Synthesis: A brief literature review

This brief literature review studies the problem of audiovisual speech s...

Please sign up or login with your details

Forgot password? Click here to reset