Progressive Spatio-temporal Perception for Audio-Visual Question Answering

08/10/2023
by   Guangyao Li, et al.
0

Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region selection module is utilized to choose the most relevant regions associated with the question from the selected temporal segments. To further refine the selection of features, an audio-guided visual attention module is employed to perceive the association between auido and selected spatial regions. Finally, the spatio-temporal features from these modules are integrated for answering the question. Extensive experimental results on the public MUSIC-AVQA and AVQA datasets provide compelling evidence of the effectiveness and efficiency of PSTP-Net. Code is available at: \href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}

READ FULL TEXT

page 1

page 3

page 4

page 8

research
03/26/2022

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

In this paper, we focus on the Audio-Visual Question Answering (AVQA) ta...
research
05/13/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Videos convey rich information. Dynamic spatio-temporal relationships be...
research
05/21/2023

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Audio-visual question answering (AVQA) is a challenging task that requir...
research
05/03/2023

"Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization

Most deepfake detection methods focus on detecting spatial and/or spatio...
research
05/29/2023

Multi-Scale Attention for Audio Question Answering

Audio question answering (AQA), acting as a widely used proxy task to ex...
research
06/16/2020

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Fact-based Visual Question Answering (FVQA) requires external knowledge ...
research
05/03/2022

Episodic Memory Question Answering

Egocentric augmented reality devices such as wearable glasses passively ...

Please sign up or login with your details

Forgot password? Click here to reset