EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

09/14/2023
by   Navin Raj Prabhu, et al.
0

Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed capable of synthesizing speech with a controllable target emotion. Crucially, the proposed approach shows improved performance along the extreme values of arousal and thereby addresses a common challenge in the speech emotion conversion literature.

READ FULL TEXT

page 2

page 4

research
06/02/2023

In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

Speech emotion conversion aims to convert the expressed emotion of a spo...
research
11/14/2021

Textless Speech Emotion Conversion using Decomposed and Discrete Representations

Speech emotion conversion is the task of modifying the perceived emotion...
research
09/14/2023

Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion

Speech anonymisation prevents misuse of spoken data by removing any pers...
research
11/03/2018

Nonparallel Emotional Speech Conversion

We propose a nonparallel data-driven emotional speech conversion method....
research
11/09/2022

A Diffeomorphic Flow-based Variational Framework for Multi-speaker Emotion Conversion

This paper introduces a new framework for non-parallel emotion conversio...
research
09/14/2023

StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings

Voice conversion (VC) transforms an utterance to sound like another pers...
research
08/06/2023

Photorealistic and Identity-Preserving Image-Based Emotion Manipulation with Latent Diffusion Models

In this paper, we investigate the emotion manipulation capabilities of d...

Please sign up or login with your details

Forgot password? Click here to reset