Multi-Granularity Network with Modal Attention for Dense Affective Understanding

06/18/2021
by   Baoming Yan, et al.
0

Video affective understanding, which aims to predict the evoked expressions by the video content, is desired for video creation and recommendation. In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. In this paper, we propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. Specifically, the multi-granularity features could be divided into frame-level, clips-level and video-level features, which corresponds to visual-salient content, semantic-context and video theme information. Then the modal attention fusion module is designed to fuse the multi-granularity features and emphasize more affection-relevant modals. Finally, the fused feature is fed into a Mixtures Of Experts (MOE) classifier to predict the expressions. Further employing model-ensemble post-processing, the proposed method achieves the correlation score of 0.02292 in the EEV challenge.

READ FULL TEXT

page 1

page 2

research
05/18/2022

VRAG: Region Attention Graphs for Content-Based Video Retrieval

Content-based Video Retrieval (CBVR) is used on media-sharing platforms ...
research
11/27/2022

MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding

Document images are a ubiquitous source of data where the text is organi...
research
01/18/2023

HiDAnet: RGB-D Salient Object Detection via Hierarchical Depth Awareness

RGB-D saliency detection aims to fuse multi-modal cues to accurately loc...
research
09/16/2020

Dual Semantic Fusion Network for Video Object Detection

Video object detection is a tough task due to the deteriorated quality o...
research
02/07/2020

iqiyi Submission to ActivityNet Challenge 2019 Kinetics-700 challenge: Hierarchical Group-wise Attention

In this report, the method for the iqiyi submission to the task of Activ...
research
09/29/2018

FusedLSTM: Fusing frame-level and video-level features for Content-based Video Relevance Prediction

This paper describes two of my best performing approaches on the Content...
research
03/10/2023

MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling

Video-and-language understanding has a variety of applications in the in...

Please sign up or login with your details

Forgot password? Click here to reset