Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

05/21/2023
by   Yuanyuan Jiang, et al.
0

Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-temporal reasoning over multimodal contexts. To achieve scene understanding ability similar to humans, the AVQA task presents specific challenges, including effectively fusing audio and visual information and capturing question-relevant audio-visual features while maintaining temporal synchronization. This paper proposes a Target-aware Joint Spatio-Temporal Grounding Network for AVQA to address these challenges. The proposed approach has two main components: the Target-aware Spatial Grounding module, the Tri-modal consistency loss and corresponding Joint audio-visual temporal grounding module. The Target-aware module enables the model to focus on audio-visual cues relevant to the inquiry subject by exploiting the explicit semantics of text modality. The Tri-modal consistency loss facilitates the interaction between audio and video during question-aware temporal grounding and incorporates fusion within a simpler single-stream architecture. Experimental results on the MUSIC-AVQA dataset demonstrate the effectiveness and superiority of the proposed method over existing state-of-the-art methods. Our code will be availiable soon.

READ FULL TEXT

page 1

page 8

research
03/26/2022

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

In this paper, we focus on the Audio-Visual Question Answering (AVQA) ta...
research
08/10/2023

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Audio-Visual Question Answering (AVQA) task aims to answer questions abo...
research
03/30/2022

TubeDETR: Spatio-Temporal Video Grounding with Transformers

We consider the problem of localizing a spatio-temporal tube in a video ...
research
03/29/2023

What, when, and where? – Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Spatio-temporal grounding describes the task of localizing events in spa...
research
05/03/2023

"Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization

Most deepfake detection methods focus on detecting spatial and/or spatio...
research
04/13/2022

Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Due to its high societal impact, deepfake detection is getting active at...
research
05/29/2023

Multi-Scale Attention for Audio Question Answering

Audio question answering (AQA), acting as a widely used proxy task to ex...

Please sign up or login with your details

Forgot password? Click here to reset