Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

08/14/2020
by   Bin Duan, et al.
9

The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. In this paper, we propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization. Particularly, we present a concise yet valid architecture that effectively learns representations from multiple modalities in a joint manner. Initially, visual features are combined with auditory features and then turned into joint representations. Next, we make use of the joint representations to attend to visual features and auditory features, respectively. With the help of this joint co-attention, new visual and auditory features are produced, and thus both features can enjoy the mutually improved benefits from each other. It is worth noting that the joint co-attention unit is recursive meaning that it can be performed multiple times for obtaining better joint representations progressively. Extensive experiments on the public AVE dataset have shown that the proposed method achieves significantly better results than the state-of-the-art methods.

READ FULL TEXT

page 1

page 7

page 8

research
03/23/2018

Audio-Visual Event Localization in Unconstrained Videos

In this paper, we introduce a novel problem of audio-visual event locali...
research
10/16/2021

Hybrid Mutimodal Fusion for Dimensional Emotion Recognition

In this paper, we extensively present our solutions for the MuSe-Stress ...
research
08/16/2023

SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

The integration of different modalities, such as audio and visual inform...
research
04/07/2021

MPN: Multimodal Parallel Network for Audio-Visual Event Localization

Audio-visual event localization aims to localize an event that is both a...
research
05/12/2021

Operation-wise Attention Network for Tampering Localization Fusion

In this work, we present a deep learning-based approach for image tamper...
research
11/24/2021

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Recognizing and localizing events in videos is a fundamental task for vi...
research
05/13/2021

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Most of the prior studies in the spatial DoA domain focus on a single mo...

Please sign up or login with your details

Forgot password? Click here to reset