Enhancing the Prediction of Emotional Experience in Movies using Deep Neural Networks: The Significance of Audio and Language

Our paper focuses on making use of deep neural network models to accurately predict the range of human emotions experienced during watching movies. In this certain setup, there exist three clear-cut input modalities that considerably influence the experienced emotions: visual cues derived from RGB video frames, auditory components encompassing sounds, speech, and music, and linguistic elements encompassing actors' dialogues. Emotions are commonly described using a two-factor model including valence (ranging from happy to sad) and arousal (indicating the intensity of the emotion). In this regard, a Plethora of works have presented a multitude of models aiming to predict valence and arousal from video content. However, non of these models contain all three modalities, with language being consistently eliminated across all of them. In this study, we comprehensively combine all modalities and conduct an analysis to ascertain the importance of each in predicting valence and arousal. Making use of pre-trained neural networks, we represent each input modality in our study. In order to process visual input, we employ pre-trained convolutional neural networks to recognize scenes[1], objects[2], and actions[3,4]. For audio processing, we utilize a specialized neural network designed for handling sound-related tasks, namely SoundNet[5]. Finally, Bidirectional Encoder Representations from Transformers (BERT) models are used to extract linguistic features[6] in our analysis. We report results on the COGNIMUSE dataset[7], where our proposed model outperforms the current state-of-the-art approaches. Surprisingly, our findings reveal that language significantly influences the experienced arousal, while sound emerges as the primary determinant for predicting valence. In contrast, the visual modality exhibits the least impact among all modalities in predicting emotions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/16/2019

Multimodal Deep Models for Predicting Affective Responses Evoked by Movies

The goal of this study is to develop and analyze multimodal models for p...
research
11/27/2019

GLA in MediaEval 2018 Emotional Impact of Movies Task

The visual and audio information from movies can evoke a variety of emot...
research
06/25/2019

Emotion Recognition Using Fusion of Audio and Video Features

In this paper we propose a fusion approach to continuous emotion recogni...
research
11/01/2021

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...
research
03/14/2020

Emotions Don't Lie: A Deepfake Detection Method using Audio-Visual Affective Cues

We present a learning-based multimodal method for detecting real and dee...
research
07/30/2021

Recognizing Emotions evoked by Movies using Multitask Learning

Understanding the emotional impact of movies has become important for af...
research
08/20/2018

Detecting cognitive impairments by agreeing on interpretations of linguistic features

Linguistic features have shown promising applications for detecting vari...

Please sign up or login with your details

Forgot password? Click here to reset