Textless Speech Emotion Conversion using Decomposed and Discrete Representations

11/14/2021
by   Felix Kreuk, et al.
6

Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github.io/emotion.

READ FULL TEXT
research
06/02/2023

In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

Speech emotion conversion aims to convert the expressed emotion of a spo...
research
09/14/2023

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

Speech emotion conversion is the task of converting the expressed emotio...
research
11/03/2018

Nonparallel Emotional Speech Conversion

We propose a nonparallel data-driven emotional speech conversion method....
research
11/12/2022

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

We present a unified system to realize one-shot voice conversion (VC) on...
research
12/19/2022

Speaking Style Conversion With Discrete Self-Supervised Units

Voice Conversion (VC) is the task of making a spoken utterance by one sp...
research
01/09/2021

Emotion transplantation through adaptation in HMM-based speech synthesis

This paper proposes an emotion transplantation method capable of modifyi...
research
04/23/2020

Unsupervised Speech Decomposition via Triple Information Bottleneck

Speech information can be roughly decomposed into four components: langu...

Please sign up or login with your details

Forgot password? Click here to reset