Audio-Visual Event Localization in Unconstrained Videos

03/23/2018
by   Yapeng Tian, et al.
0

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.

READ FULL TEXT

page 2

page 5

page 11

page 21

research
04/07/2021

MPN: Multimodal Parallel Network for Audio-Visual Event Localization

Audio-visual event localization aims to localize an event that is both a...
research
02/20/2019

Dual-modality seq2seq network for audio-visual event localization

Audio-visual event localization requires one to identify theevent which ...
research
08/26/2021

Multi-Modulation Network for Audio-Visual Event Localization

We study the problem of localizing audio-visual events that are both aud...
research
05/08/2022

Past and Future Motion Guided Network for Audio Visual Event Localization

In recent years, audio-visual event localization has attracted much atte...
research
08/14/2020

Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

The major challenge in audio-visual event localization task lies in how ...
research
10/19/2019

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

We present an audio-visual multimodal approach for the task of zeroshot ...
research
06/12/2021

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Event classification is inherently sequential and multimodal. Therefore,...

Please sign up or login with your details

Forgot password? Click here to reset