DeepAI AI Chat
Log In Sign Up

Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context

by   Jie Zhang, et al.
South China University of Technology International Student Union
Alibaba Group

Predicting the emotional impact of videos using machine learning is a challenging task considering the varieties of modalities, the complicated temporal contex of the video as well as the time dependency of the emotional states. Feature extraction, multi-modal fusion and temporal context fusion are crucial stages for predicting valence and arousal values in the emotional impact, but have not been successfully exploited. In this paper, we propose a comprehensive framework with novel designs of modal structure and multi-modal fusion strategy. We select the most suitable modalities for valence and arousal tasks respectively and each modal feature is extracted using the modality-specific pre-trained deep model on large generic dataset. Two-time-scale structures, one for the intra-clip and the other for the inter-clip, are proposed to capture the temporal dependency of video content and emotion states. To combine the complementary information from multiple modalities, an effective and efficient residual-based progressive training strategy is proposed. Each modality is step-wisely combined into the multi-modal model, responsible for completing the missing parts of features. With all those improvements above, our proposed prediction framework achieves better performance on the LIRIS-ACCEDE dataset with a large margin compared to the state-of-the-art.


page 1

page 2

page 3

page 4


Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition

Audio-Video Emotion Recognition is now attacked with Deep Neural Network...

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

Continuous dimensional emotion prediction is a challenging task where th...

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

Recognizing the emotional state of people is a basic but challenging tas...

Large Margin Multi-modal Multi-task Feature Extraction for Image Classification

The features used in many image analysis-based applications are frequent...

DialogueTRM: Exploring the Intra- and Inter-Modal Emotional Behaviors in the Conversation

Emotion Recognition in Conversations (ERC) is essential for building emp...

Predicting Depression Severity by Multi-Modal Feature Engineering and Fusion

We present our preliminary work to determine if patient's vocal acoustic...

DeepInteraction: 3D Object Detection via Modality Interaction

Existing top-performance 3D object detectors typically rely on the multi...