Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

10/11/2021
by   Heeseung Yun, et al.
0

360^∘ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360^∘ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.

READ FULL TEXT

page 1

page 5

page 6

page 8

research
03/26/2022

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

In this paper, we focus on the Audio-Visual Question Answering (AVQA) ta...
research
12/20/2021

ScanQA: 3D Question Answering for Spatial Scene Understanding

We propose a new 3D spatial understanding task of 3D Question Answering ...
research
03/06/2020

Wavelet-based spatial audio framework

Ambisonics is a complete theory for spatial audio whose building blocks ...
research
08/18/2023

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

With the advances in large scale vision-and-language models (VLMs) it is...
research
07/10/2023

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

We propose a self-supervised method for learning representations based o...
research
07/26/2022

Equivariant and Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering the natural ...
research
05/19/2023

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Despite the availability of computer-aided simulators and recorded video...

Please sign up or login with your details

Forgot password? Click here to reset