Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

08/28/2023
by   Payal Mohapatra, et al.
0

Human emotion understanding is pivotal in making conversational technology mainstream. We view speech emotion understanding as a perception task which is a more realistic setting. With varying contexts (languages, demographics, etc.) different share of people perceive the same speech segment as a non-unanimous emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics ChallengE (ComParE) in the EMotion Share track, we leverage their rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. We demonstrate that the training scheme of different foundation models dictates their effectiveness for tasks beyond speech recognition, especially for non-semantic speech tasks like emotion understanding. This is a very complex task due to multilingual speakers, variability in the target labels, and inherent imbalance in the regression dataset. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6 baseline.

READ FULL TEXT
research
09/20/2023

Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech

Speech emotion recognition has evolved from research to practical applic...
research
03/14/2023

A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition

As a common way of emotion signaling via non-linguistic vocalizations, v...
research
10/18/2019

Indian EmoSpeech Command Dataset: A dataset for emotion based speech recognition in the wild

Speech emotion analysis is an important task which further enables sever...
research
04/28/2023

The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share Requests

The ACM Multimedia 2023 Computational Paralinguistics Challenge addresse...
research
11/19/2022

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

This paper aims to synthesize target speaker's speech with desired speak...
research
09/15/2022

Self-Supervised Attention Networks and Uncertainty Loss Weighting for Multi-Task Emotion Recognition on Vocal Bursts

Vocal bursts play an important role in communicating affect, making them...
research
06/12/2020

"Notic My Speech" – Blending Speech Patterns With Multimedia

Speech as a natural signal is composed of three parts - visemes (visual ...

Please sign up or login with your details

Forgot password? Click here to reset