Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

03/22/2023
by   Tiantian Geng, et al.
0

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.

READ FULL TEXT

page 1

page 4

page 8

page 14

page 15

research
08/26/2021

Multi-Modulation Network for Audio-Visual Event Localization

We study the problem of localizing audio-visual events that are both aud...
research
06/15/2023

Towards Long Form Audio-visual Video Understanding

We live in a world filled with never-ending streams of multimodal inform...
research
12/27/2017

Eventness: Object Detection on Spectrograms for Temporal Localization of Audio Events

In this paper, we introduce the concept of Eventness for audio event det...
research
05/29/2023

Multi-Scale Attention for Audio Question Answering

Audio question answering (AQA), acting as a widely used proxy task to ex...
research
12/27/2022

Audiovisual Database with 360 Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior, and QoE Evaluation Research

Research into multi-modal perception, human cognition, behavior, and att...
research
04/30/2023

Deep Learning Based Multimodal with Two-phase Training Strategy for Daily Life Video Classification

In this paper, we present a deep learning based multimodal system for cl...
research
11/09/2020

Improved Soccer Action Spotting using both Audio and Video Streams

In this paper, we propose a study on multi-modal (audio and video) actio...

Please sign up or login with your details

Forgot password? Click here to reset