MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

11/24/2021
by   Jiashuo Yu, et al.
7

Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in various lengths. In this paper, we present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures and integrates multi-level temporal features for audio-visual event localization and audio-visual video parsing. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively. Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.

READ FULL TEXT

page 1

page 3

page 8

page 15

page 16

page 17

page 18

page 19

research
04/07/2021

MPN: Multimodal Parallel Network for Audio-Visual Event Localization

Audio-visual event localization aims to localize an event that is both a...
research
06/01/2023

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

We focus on the weakly-supervised audio-visual video parsing task (AVVP)...
research
05/30/2021

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing

For multimodal tasks, a good feature extraction network should extract i...
research
04/05/2021

Can audio-visual integration strengthen robustness under multimodal attacks?

In this paper, we propose to make a systematic study on machines multise...
research
07/21/2020

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

In this paper, we introduce a new problem, named audio-visual video pars...
research
07/05/2023

Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize ...
research
08/14/2020

Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

The major challenge in audio-visual event localization task lies in how ...

Please sign up or login with your details

Forgot password? Click here to reset