Whose Emotion Matters? Speaker Detection without Prior Knowledge

11/23/2022
by   Hugo Carneiro, et al.
0

The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as offered, for example, in the video-based MELD dataset. However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the detection of the person speaking the utterance. In this paper we demonstrate that by using recent automatic speech recognition and active speaker detection models, we are able to realign the videos of MELD, and capture the facial expressions from uttering speakers in 96.92 with a self-supervised voice recognition model indicate that the realigned MELD videos more closely match the corresponding utterances offered in the dataset. Finally, we devise a model for emotion recognition in conversations trained on the face and audio information of the MELD realigned videos, which outperforms state-of-the-art models for ERC based on vision alone. This indicates that active speaker detection is indeed effective for extracting facial expressions from the uttering speakers, and that faces provide more informative visual cues than the visual features state-of-the-art models have been using so far.

READ FULL TEXT

page 3

page 4

page 10

research
08/14/2023

Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations

Although automatic emotion recognition (AER) has recently drawn signific...
research
09/16/2021

Invertable Frowns: Video-to-Video Facial Emotion Translation

We present Wav2Lip-Emotion, a video-to-video translation architecture th...
research
11/07/2022

Hi,KIA: A Speech Emotion Recognition Dataset for Wake-Up Words

Wake-up words (WUW) is a short sentence used to activate a speech recogn...
research
05/10/2022

Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Under noisy conditions, automatic speech recognition (ASR) can greatly b...
research
11/17/2022

Privacy against Real-Time Speech Emotion Detection via Acoustic Adversarial Evasion of Machine Learning

Emotional Surveillance is an emerging area with wide-reaching privacy co...
research
04/28/2020

Exploring the Contextual Dynamics of Multimodal Emotion Recognition in Videos

Emotional expressions form a key part of user behavior on today's digita...
research
05/23/2019

Speech2Face: Learning the Face Behind a Voice

How much can we infer about a person's looks from the way they speak? In...

Please sign up or login with your details

Forgot password? Click here to reset