Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0

03/14/2020
by   Zack Hodari, et al.
0

In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2019

A Perceived Environment Design using a Multi-Modal Variational Autoencoder for learning Active-Sensing

This contribution comprises the interplay between a multi-modal variatio...
research
05/04/2023

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Recently, emotional talking face generation has received considerable at...
research
04/20/2022

Cross-stitched Multi-modal Encoders

In this paper, we propose a novel architecture for multi-modal speech an...
research
11/07/2021

Emotional Prosody Control for Speech Generation

Machine-generated speech is characterized by its limited or unnatural em...
research
08/15/2020

EigenEmo: Spectral Utterance Representation Using Dynamic Mode Decomposition for Speech Emotion Classification

Human emotional speech is, by its very nature, a variant signal. This re...
research
10/28/2019

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

In this paper we propose a Sequential Representation Quantization AutoEn...
research
07/30/2018

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Generating versatile and appropriate synthetic speech requires control o...

Please sign up or login with your details

Forgot password? Click here to reset