Learning to Answer Questions in Dynamic Audio-Visual Scenarios

03/26/2022
by   Guangyao Li, et al.
10

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset: http://gewu-lab.github.io/MUSIC-AVQA/

READ FULL TEXT

page 2

page 3

page 6

page 7

page 9

page 10

page 12

page 14

research
08/10/2023

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Audio-Visual Question Answering (AVQA) task aims to answer questions abo...
research
05/21/2023

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Audio-visual question answering (AVQA) is a challenging task that requir...
research
04/25/2019

TVQA+: Spatio-Temporal Grounding for Video Question Answering

We present the task of Spatio-Temporal Video Question Answering, which r...
research
10/11/2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

360^∘ videos convey holistic views for the surroundings of a scene. It p...
research
01/13/2018

Benchmark Visual Question Answer Models by using Focus Map

Inferring and Executing Programs for Visual Reasoning proposes a model f...
research
05/03/2022

Episodic Memory Question Answering

Egocentric augmented reality devices such as wearable glasses passively ...
research
05/29/2023

Multi-Scale Attention for Audio Question Answering

Audio question answering (AQA), acting as a widely used proxy task to ex...

Please sign up or login with your details

Forgot password? Click here to reset