Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning

06/15/2022
by   Rui Liu, et al.
0

Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope of applications, especially for out-of-domain or unseen speech. In this paper, we propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. This is achieved by the fusion of emotional data from various domains. We follow a multi-task learning network architecture that includes an acoustic encoder, a strength predictor, and an auxiliary emotion predictor. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech. We release the source codes at: https://github.com/ttslr/StrengthNet.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2021

StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis

Recently, emotional speech synthesis has achieved remarkable performance...
research
10/28/2020

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Emotional voice conversion aims to transform emotional prosody in speech...
research
05/23/2023

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

Emotional Text-To-Speech (TTS) is an important task in the development o...
research
10/25/2022

Mixed Emotion Modelling for Emotional Voice Conversion

Emotional voice conversion (EVC) aims to convert the emotional state of ...
research
11/07/2021

Emotional Prosody Control for Speech Generation

Machine-generated speech is characterized by its limited or unnatural em...
research
11/30/2021

CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model with Transformer

In this study, we explore the transformer's ability to capture intra-rel...
research
06/04/2019

ShEMO -- A Large-Scale Validated Database for Persian Speech Emotion Detection

This paper introduces a large-scale, validated database for Persian call...

Please sign up or login with your details

Forgot password? Click here to reset