Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

by   Heeseung Yun, et al.
Seoul National University

360^∘ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360^∘ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.



There are no comments yet.


page 1

page 5

page 6

page 8


Learning to Answer Questions in Dynamic Audio-Visual Scenarios

In this paper, we focus on the Audio-Visual Question Answering (AVQA) ta...

ScanQA: 3D Question Answering for Spatial Scene Understanding

We propose a new 3D spatial understanding task of 3D Question Answering ...

Wavelet-based spatial audio framework

Ambisonics is a complete theory for spatial audio whose building blocks ...

TVQA+: Spatio-Temporal Grounding for Video Question Answering

We present the task of Spatio-Temporal Video Question Answering, which r...

Semantic Audio-Visual Navigation

Recent work on audio-visual navigation assumes a constantly-sounding tar...

From FiLM to Video: Multi-turn Question Answering with Multi-modal Context

Understanding audio-visual content and the ability to have an informativ...

Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D

Understanding spatial relations (e.g., "laptop on table") in visual inpu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to their capacity to capture entire surroundings without a restriction in the field of view, videos have been gaining increasing popularity as a novel medium to record real-life scenery. As illustrated in Fig. LABEL:fig:data_example, unlike conventional normal field-of-view (NFoV) videos, videos allow users to attend to any regions of interest from the original real-life surroundings. As publicly available videos surge from video-sharing platforms (, YouTube) and their applications of omnidirectional perception quickly spread from autonomous vehicles [60, 5], robotics [21, 35] to virtual & augmented reality [32, 46], visual understanding in

videos has warranted serious attentions in computer vision research.

The wide field of view of videos brings forth new challenges in visual understanding that are under-emphasized in the NFoV video understanding, including spherical spatial reasoning and audio-visual reasoning. Since videos are encoded in a spherical ambient space, spatial reasoning in video, namely spherical spatial reasoning, requires a novel approach to recognizing various relations between the objects all around. Moreover, videos contain more diverse visual sources of sounds than conventional videos, which allows richer contextual audio-visual correspondences. Given that spatial attention for visual and auditory stimuli is inherent and even aligned in human [45]

, capturing the link among visual and auditory signals from panoramic videos can be highly beneficial to real-life scene understanding.

These two of the main cornerstones of video understanding, namely spherical spatial reasoning and audio-visual reasoning, have been actively addressed by previous works, including automatic cinematography [49], panoramic saliency detection [64, 7], and self-supervised spatial audio generation [37]. Nonetheless, no known task incorporates linguistic queries to tackle the tasks in video domain. To this end, we propose spatial and audio-visual question answering on videos as a novel benchmark task for video understanding.

In this work, we introduce the Pano-AVQA dataset as a new video question answering dataset that necessitates fine-grained incorporation of visual, audio, and language modality on panoramic videos. We collect openly available videos from online and annotate them with (audio, video, relationship) description pairs; as a result, we contribute 20K spatial and 31.7K audio-visual question-answer pairs with bounding box grounding from 5.4K panoramic video clips.

Upon this dataset, we propose a transformer [53]

-based spatial and audio-visual question answering framework. By attending to the context provided by other modalities throughout training, our model learns to fuse holistic information from the panoramic surroundings. For this, we suggest quaternion-based coordinate representation for accurate spatial representation and an auxiliary task of audio skewness prediction that are broadly applicable to multichannel audio inputs.

We summarize our main contributions as follows.

  1. We propose novel benchmark tasks on spatial and audio-visual question answering on videos towards a holistic semantic understanding of omnidirectional surroundings.

  2. Since there is no existing dataset for this objective to the best of our knowledge, we contribute Pano-AVQA as the first large-scale spatial and audio-visual question answering dataset on videos, consisting of 51.7K question-answer pairs with bounding box grounding.

  3. We design an audio-visual question answering model for videos that effectively fuses multimodal cues from the panoramic sphere. We incorporate this model with several baseline systems and evaluate them on the Pano-AVQA dataset.

Figure 1: The data collection pipeline of the Pano-AVQA dataset discussed in Sec. 3.

2 Related Works

Understanding of Panoramic Videos. A large body of literature regarding video extends visual understanding of panoramic videos to many pragmatic applications, such as automatic cinematography [49], highlight detection [61], summarization [29], tracking [23] and visual saliency detection [64, 7].

However, most of the prior works concentrate on diverse visual cues present in the panoramic videos. Some of the recent works like narrative description for grounding viewpoints [9], spatial audios for audio augmentation [37] or object removal [44] focus on exploiting modalities other than visual cues. Unlike the prior works, we exploit language queries to evaluate the understanding of audio-visual signals in panoramic videos. We provide a large-scale annotated dataset about panoramic videos, which can be potentially beneficial for audio-visual grounding or scene graph generation in panoramic videos.

Multimodal Question Answering. Stemming from image VQA [2], video VQA has been extensively studied [51, 24, 59, 26, 30, 62, 17] towards understanding of visual-linguistic relations in various contexts such as movies [51], TV shows [30, 17], web GIFs [24] and animation clips [26]. Recently, there have been emerging works on answering questions grounded on sound modality, including Diagnostic Audio Question Answering [14], Open-Domain Spoken Question Answering [28] and Audio-Visual Scene-aware Dialog [1].

Closest to our work is AVSD [1], which utilizes both audio and video information in the clip to answer the questions sequentially. Although AVSD evaluates the conversation capability of models, when it comes to audio-visual relationships, AVSD mainly focuses on the existence of sound (ex. Do you hear any noise in the background?). On the other hand, Pano-AVQA deals with fine-grained audio-visual relationships like grounding or spatial reasoning in videos (ex. What is on the opposite side of a loud honking?). Specifically, we deal with a variety of spatial relations in the panoramic sphere, which sheds new light upon spatial reasoning in videos.

Audio-Visual Scene Understanding. Leveraging both audio and video for scene understanding has been broadly researched in the signal processing domain. Early works on multimodal audio-visual learning focused on improving audio-visual speech recognition [38, 47]. Owing to paired videos with audio prevalent in various platforms, recent approaches utilize representation learning of unlabeled videos  [40, 3, 16, 66, 52, 22], which is beneficial for various downstream tasks like sound localization [35], audio spatialization [16], audio-visual source separation and co-segmentation [66, 65, 3, 40].

While these approaches showed some successes in audio-visual scene understanding, they assume that the viewpoint is already attended to a salient context. Some of the previous researches focus on audio-visual scene understanding on panoramic videos [52, 35], but they regard panoramic frames as normal ones, ignoring non-negligible distortion present in the panoramic video. Contrarily, we tackle the alignment of audio and video without pre-determined context, , normal field of view, thereby considering more context in surroundings.

Dataset Task C # Clips Length (hr) Additional information
Pano2Vid [49] NFoV cinematography H 86 7.3 NFoV videos
Deep360Pilot [23] Object tracking H 91 1.71 Object track
Yu et al. [61] Highlight detection H 115 72 NFoV videos
Lee et al. [29] Summarization H 285 92.23 Photostream
Narrated360 [9] NFoV grounding H 864 3.98 -
YT-ALL [37] Audio spatialization H 1146 113.1 -
REC-STREET [37] Audio spatialization R 43 3.5 -
OAP [52] Object prediction R 165 15 -
Pano-AVQA Question answering H 5.4k 7.69 QA with grounding
Table 1: Comparison of Pano-AVQA with existing video datasets. Column C denotes collection procedure, where H indicates the dataset harvested online and R is the dataset recorded with a custom apparatus.

3 Pano-AVQA Dataset

The objective of Pano-AVQA dataset is to provide a benchmark for fine-grained spatio-temporal and audio-visual question answering (QA) on panoramic videos. To achieve this goal, each question-answer pair should encapsulate audio signal as well as visual objects in the clip. Since no existing dataset can be used for this objective, we collect the data from scratch.

Fig. 1 illustrates our dataset collection pipeline. From the videos collected online, we extract clips of about 5 seconds, from which we collect three types of annotations from human workers: (a) bounding boxes and sound grounding, (b) visual and sound descriptions, and (c) question-answer pairs. Please refer to the Appendix for the full description of dataset construction.

3.1 Task Definition

We introduce two new types of panoramic question answering tasks essential for panoramic scene understanding: (i) spherical spatial reasoning and (ii) audio-visual reasoning, where we design both tasks as open-ended questions. Please refer to Fig. LABEL:fig:data_example and Appendix for QA pair examples.

Spherical spatial reasoning tackles QAs that require recognizing spatial relations between objects in videos. Since videos lack any principal orientation, we only question relative spatial relation. That is, we consider the spatial relation of a target object to a reference object. Each answer can be a name or an attribute (, color, action, ) of the object, or one of the following spatial relations: left/right to, opposite of, above/below, or next to. One exemplar template of this task include Where is [object1] in relation to [object2]? / [relation].

Audio-visual reasoning covers queries about identifying the object from sound and vice versa for a specific visual object and the sound the object is making. Possible answers include the object or sound themselves or their attributes like color or loudness. Two example templates of this task are Who/what is making [sound]? / [object]. or Which sound is [object] making? / [sound].

3.2 Data Collection

We collect videos from YouTube using 58 keywords (, sports, tour, indoor, cooking) to foster diversity in context. For consistency, we convert every video into an equirectangular format and discard videos with mono channel audio. For valid audio-visual QA pairs, the video must contain clear, discernible audio signals. Since raw video is often too long and contains uneventful contents, we extract clips of interest spanning five seconds on average. We implement an automated extractor that reads raw audio source and video frames and slices around audio peaks

whose root mean square amplitudes are greater than those of surrounding segments by at least the standard deviation of the root mean square amplitude of the entire audio.

During extraction, we apply the following filters to ensure the quality of clips. First, we reduce the chance of including similarly sounding clips using the distance between Mel-frequency coefficients of each candidate clip. Second, we discard clips containing synthetic or computer-generated frames by inspecting skewness in color histograms. Third, we filter out static clips; we compute the 64bit DCT image hash using pHash of each frame and neglect any clips with less than three hash values. Finally, using off-the-shelf object detector [57], we remove clips with less than three salient objects. In addition to automatic filtering, we inspect any remaining invalidity, including occlusion, post-dubbing, and the existence of background music.

3.3 Data Annotations

It can be too cognitively burdensome even for humans to directly create a question-answer pair involving visual and audio features from videos. Therefore, we decompose the entire annotation pipeline into three subtasks to reduce complexity while obtaining fine-grained annotations: bounding box collection, visual / sound description, and question answer generation. The results of each subtask are validated before proceeding onto the next subtask.

Bounding Box Collection. First, we provide workers with a set of candidate bounding boxes and ask them to choose those that enclose objects that are making a sound. These objects should be either clearly identified as a sound source or humanly inferrable despite occlusion (, man in a mask talking). To obtain candidate bounding boxes, we run Detectron 2 [57]

pretrained on ImageNet detection dataset 


to the central frame of the clip. We pretrain the model from scratch using the ImageNet detection dataset, which includes many sound-making objects such as guitar and drum. To capture objects of different sizes with minimum distortion, we extract bounding boxes from both equirectangular and multiple NFoV projections. We then calibrate the coordinates of the bounding boxes from the perspective projections to the spherical coordinates. Given coordinate

and perspective , we use straightforward yet effective strategy to obtain the spherical coordinate :

Figure 2:

Illustrations of Pano-AVQA dataset statistics. (a) Distribution of first n-grams in questions. (b) Distribution of top-20 frequent answers. (c) Distribution of top-3 AudioSet 

[18] taggings. (d) Distribution of center points of bounding box groundings for answers.

Visual and Sound Description. Workers are asked to briefly describe (1) the appearances or actions of the annotated objects and (2) the sound they are making (if any). Writing a sound description is not as straightforward as writing a visual description. To assist workers with creating more graphic descriptions, we provide them with sound-describing words (, shout, strum, bang, ) extracted from audio classification and captioning datasets [18, 13, 25]. We also refrain workers from describing the sound via visual keywords (, shout in male voice instead of man yelling next to a table) or content of the speech (, woman explaining the history of the museum).

Question Answer Pairs. Given short descriptions of objects and sounds, we finally create the spherical spatial and audio-visual QA pairs. Following collection practices in existing video QA datasets [24, 63, 58], we combine manual and automated QA generation.

From the collection of object and sound descriptions for each video, we follow the templates discussed in Sec. 3.1 to generate QA pairs. To obtain spatial relations for the spherical spatial reasoning task, we use bounding box coordinates to manually designate the relations between the objects into one of the following categories: next to, opposite of, left/right to, and above/below.

One limitation of a template-based generation is that the answer distribution may have a strong statistical bias with some words in the question template, leaving the question answerable without taking the context into account. For example, the abundance of man/woman annotated with utterance-related sound descriptions might bring in a misconception that all visible people in the scene are speaking. To alleviate this problem, we generate additional QA pairs by replacing original descriptions with unrelated audio and visual descriptions or throwing identical questions on counterexample clips like with non-speaking persons in this case.

Postprocessing. To ensure the grammatical correctness of collected QA pairs, we use LanguageTool111 for proofreading. We also manually validate whether the question is answerable from the video, bounding boxes are correct, and sound description is included in the QA pair in any form for audio-visual QAs.

3.4 Data Analysis

Pano-AVQA consists of 51.7K QA pairs (42.8K training, 3.7K validation, 5.3K testing) from 5.4K clips extracted from 2.9K videos. There are in total 5.8K unique answers, with an average length of 3.7 words. The average question length is 12.1 words. Compared to other datasets on videos in Table 1, Pano-AVQA contributes a large-scale and diverse dataset on videos along with additional annotations, , QA with groundings, relevant to video clips.

Among the QA pairs, 20K pairs belong to spherical spatial reasoning, and 31.7K pairs belong to audio-visual reasoning. We can easily notice the prevalence of questions with spatial relations (the words “next to, opposite of, left/right to, and above/below” are aggregated to [REL] token for visibility) or words relevant to audio-visual reasoning like source, origin, causing and producing from the sunburst diagram in Fig. 2(a).

Containing audio signals from diverse sources is crucial for audio-visual reasoning in real-life. Fig. 2(c) shows the distribution of top-3 Audioset [18]

taggings obtained by running pretrained audio neural networks 

[27]. Although the human sound (, speech, narration, ) tag is the most frequent due to the prevalence of vlog in the video set, our dataset still contains a sizable number of other tags like vehicles, animal, and musical instruments. Moreover, human speech depends on factors like vocal tone, pace, and style . Our dataset reflects these different patterns by generating QA pairs from detailed descriptions of human speeches like narration in loud tone, murmuring, etc.

Along with QA pairs, our dataset contains 51.7K objects annotated with bounding boxes that are the most relevant to answering the question, , answer grounding. Fig. 2(d) illustrates the distribution of the center points of the bounding boxes. While the majority of the points are located near the equator (, ), considerable amounts of boxes are well spread away from the equator and even positioned near the poles. This distribution demonstrates that our dataset reflects various spherical spatial properties of videos from a wide, holistic perspective.

4 Approach

To address the new problems of audio-visual question answering on panoramic videos, we present a model named LAViT (Language Audio-Visual Transformer), as illustrated in Fig. 3. It focuses on resolving two challenges of modeling (i) the feature representation of the video, audio, and language and (ii) the encoder-decoder structure that reconciles three different modalities. In summary, we tackle these issues by (i) extracting spherical spatial embedding from a set of visual objects and audio events, and (ii) utilize transformer-based architecture as a multimodal encoder, inspired by its recent success in VQA research [50, 34, 31, 48, 6, 67].

Figure 3: Overview of the proposed architecture named LAViT (Language Audio-Visual Transformer).

4.1 Input Representations

Visual Representation. We first uniformly sample the video at 1 fps (, about five panoramic frames) to reduce computational complexity while maintaining the temporal context in the video. As explained in Sec. 3.3, we use faster R-CNN [41] trained with ImageNet Detection [42] to extract and represent region proposals. We apply it to both equirectangular and NFoV projections, which are complementary since the former format shows key objects more continuous and larger, and the latter format displays objects with less distortion. We apply non-maximum suppression using spherical coordinates to filter out overlapping proposals from the two different projections with an IoU threshold . If there are too many objects detected, we only keep top-35 proposals with higher confidence. Finally, we obtain object embeddings per video, where is the number of proposals.

Next, we convert the Cartesian coordinates of the region proposals into the rotation quaternion based spatial representation to reflect the spherical geometry:


where denotes the time step in seconds, is a rotation angle from the bottom of the sphere

to the center of the object, the unit vector

is the position of object center, and is the width and height. For the uniqueness of the axis of rotation, we only select the axis on a horizontal plane, , XY-plane, thereby omitting the z-axis from the rotation quaternion.

Finally, we obtain visual representations , where for using linear FC layers . We obtain by average-pooling , and use it as a special visual symbol [CLS] similar to [CLS] symbol in [12] or <IMG> token in [34].

Audio Representation. We use stereo audio to reflect the spatial information of the surroundings [15]. As a feature extractor, we adopt a VGG-like CNN [27] trained with AudioSet [18]. We run the extractor to audio signals on the left and right channels separately. Since segmenting the audio into equal lengths may result in mixing different events, we need a reasonable way to recognize when the audio event changes. Motivated by CTC [20], we regard audio segments with the same top-k classes as a single event. Therefore, we split the audio stream into multiple segments whose top-k labels (

) are identical. For each audio event, we max-pool the corresponding audio features, thereby obtaining left channel audio embeddings

and right channel audio embeddings , where is the number of events.

We finally obtain audio representation , where for using linear FC layers . corresponds to a special audio symbol [CLS], where we average pool the rest of the audio representations.

Language Representation. We use the WordPiece tokenizer [56] to split the questions into tokens and use pre-trained  [12] to extract language representations , where is a special language symbol [CLS].

4.2 Encoder

The encoder of our model consists of three unimodal encoders and one multimodal encoder as shown in Fig. 3(b).

Unimodal Encoder. To each of the language, audio, and visual input representations , we first apply layer normalization [4] and feed them into the corresponding unimodal encoder, for which we use the encoder module of Transformer [53]. We stack nine encoding layers for language and five layers for audio and visual modality, as in [50]. The number of layers can be adjusted in the context of computing resources or performance.

Multimodal Encoder. We utilize the encoding layers of Transformer for multimodal encoding as well, but with different attention input. To be specific, we use the primary modality as an attentional query (, primary path) and another modality as an attentional key-value (, context path) so that two different modalities can be fused in one encoding layer. We stack two encoding layers per modality to perform this with the other two modalities. For unimodal encoder output and Transformer encoding layer primary context, we obtain multimodal encoder output :

Decoder. We obtain the average-pooled representations , , from multimodal encoder output , which are used as the special symbols [CLS], [CLS], [CLS], respectively. We finally concatenate all three aggregated representations and feed them into two three-layered MLPs, one for predicting answer label, for which we take argmax onto the output and the other for answer grounding.

4.3 Training

Following the training practice of transformer-based architectures, we utilize pretraining and finetuning objectives to train the model. For pretraining, we randomly mask visual, audio, and language input representations

with a probability of 0.15 and train the model with the following pretext tasks.

Language Pretraining Task. We use masked token prediction with cross-entropy loss as suggested in [12], by predicting the masked part of the language input.

Visual Pretraining Tasks. Instead of predicting the representation itself or its classification label, we add an MLP that predicts spherical spatial embedding from the masked visual representation with a smooth L1 loss.

Audio Pretraining Tasks. Designing a pretext for audio representation is less straightforward than visual ones. We thus propose spatial skewness prediction of the masked audio representations. Compared to phoneme classification or speaker classification generally adapted in audio transformers [10, 8], which may be limited in the utterance domain, our spatial skewness prediction can generally be applied to any media with multichannel audio and without any teacher model. We regard the stereo audio channel as a 3D audio with two silent channels and apply spherical harmonics decomposition to measure spatial skewness from given audio, , from which direction the audio is coming. That is, from the truncated spherical harmonics decomposition of an audio , where is the spherical harmonics, we extract the coefficient , which reflects how much sound is originated from position . We map the obtained skewness from to and train an MLP with a smooth L1 loss to predict the masked audio representation’s skewness along with the timestamp (, start time and duration).

QAs with Grounding. We use the Question-Answer pairs with grounding as a multimodal task for both pretraining and fine-tuning. We formulate the question-answering task as a classification problem where the model selects the best answer candidate over the 2020-D answer table, which covers approximately 93% of the questions. Specifically, we provide aggregated representations from multimodal encoder as input to an MLP to predict the answer and coordinate grounding, respectively. We train answer prediction with a cross-entropy loss and coordinate grounding with a smooth L1 loss.

Implementation Details.

Except for input feature extraction, we train our model end-to-end with a batch size of 32, gradient accumulation of 4, and dropout with a rate of 0.1. We optimize with AdamW 


with an initial learning rate of 1e-4 for three epochs as pretraining and fine-tune the model for another seven epochs with a learning rate of 5e-5. In both stages, we aggregate all losses from the tasks with equal weights but the grounding task, which is set to 0.2 to balance its influence against the question-answer task. We use a linear scheduler with a warmup rate of 0.1.

5 Experiments

5.1 Experimental Setup

Baselines. To evaluate the proposed encoding strategies of different modalities, we compare with AVSD [1], BERT [12], SparseGraph [39] and LXMERT [50]. AVSD suggests a late fusion-based approach for audio-visual dialog, for which the pretrained BERT can be a better language backbone. SparseGraph and LXMERT are chosen as the representative models for image question answering. For a fair comparison, we use the same tokenizer and multimodal encoder (including audio) as in LAViT.

Different Spherical Spatial Representations. As claimed in [39], providing appropriate spatial embedding is paramount for good performance in visual question answering. To explore the effectiveness of using quaternion representation for spatial embedding in the spherical panorama, we experiment with a few other possible spatial representations: Cartesian coordinates , spherical coordinates , and normal 3D coordinates .

Evaluation Metrics. We measure the accuracy on the Pano-AVQA test split as the percentage of correctly answered questions. As mentioned in Sec. 4.3, the VQA task is formulated as a classification problem; selecting the best word over the dictionary vocabularies. For the answer grounding task that predicts bounding box coordinates, we use the mean squared error.

5.2 Results and Analyses

Comparison with VQA Models. Effective multimodal fusion is one of the paramount issues to correctly address the questions in the Pano-AVQA dataset. In Table 2, the sharp performance drop of AVSD and BERT compared to our model suggests that late fusion-based approaches are less adept at incorporating different modalities. Compared to SparseGraph [39] and LXMERT [50] that can effectively fuse visual and language modalities, our model performs 5.85% and 2% better, respectively.

Good performance of prior-based models may imply that the answer distribution is skewed toward a few popular answers. Accuracies of prior-based models in our dataset are 21.47 and 32.49, which are lower than those in VQA [2] (, 29.66 and 37.54, respectively).

Ablation. Our model without the unimodal encoders (LAViT) attains 6.35% performance drop, which indicates the importance of loading pretrained language model as well as maintaining the context of unimodal input. Opting out either visual or audio input decreases performance by 2.5% and 1.76%, implying the importance of utilizing both modalities.

Influence of FoV Selection. The model trained with single NFoV in videos, which corresponds to a video captured with a conventional camera, is 2% lower than our model, denoting the importance of a wider field of view. Meanwhile, the performance of the ER-only model is lower than the model trained with dense NFoV, which is presumably due to overlooking smaller objects. Still, utilizing both ER and NFoVs as in Fig. 3(a) shows the best performance.

MSE Accuracy (%)
Model Ground SS AV All
Prior (“yes“) - 28.92 16.75 21.47
Q-Type Prior - 36.30 32.42 32.49
AVSD [1] - 29.40 20.10 24.60
BERT [12] - 36.88 38.43 37.83
SparseGraph [39] - 42.89 45.74 44.64
LXMERT [50] - 47.48 49.12 48.48
LAViT - 39.42 47.14 44.14
LAViT - 46.90 48.68 47.99
LAViT 0.556 48.75 48.71 48.73
LAViT - 47.14 49.37 48.50
LAViT 0.605 47.63 50.17 49.18
LAViT 0.593 47.68 51.13 49.79
LAViT (ours) 0.629 49.29 51.25 50.49
Table 2: Results on Pano-AVQA test split. SS denotes spherical spatial reasoning task and AV denotes audio-visual reasoning task.
MSE Accuracy (%)
Embeddings Ground SS AV All
V Cartesian 0.166 47.48 51.41 49.89
Spherical 3.496 48.95 51.01 50.21
Unit sphere 1.378 49.49 50.05 49.83
Quaternion 0.629 49.29 51.25 50.49
Table 3: Experimental results of different spherical spatial embeddings. The grounding errors of different representations are not comparable as they have different error scales.

Spherical Spatial Representations. Table 3 shows that the unit sphere and quaternion-based spatial embeddings perform better in the spherical spatial reasoning task, while the Cartesian coordinates works the worst. Although the Cartesian-based model has the lowest grounding error, it is mainly due to the error scale of Cartesian coordinates. Thus, the ground errors beween the spatial embeddings are not directly comparable. Fig. 4 displays different answer grounding proposals per geometry. In general, embeddings with spherical spatial information performs better than Cartesian-based proposals. Still, our quaternion-based approach displays notable localization ability compared to other proposals, especially in the examples from the second column. Please refer to the Appendix for more inference examples and visualization.

Figure 4: Qualitative examples of answer grounding from Table 3.

6 Conclusion

Our work extended existing works on panoramic video understanding by proposing video question answering as a novel task to evaluate spherical spatial and audio-visual reasoning capacity of models in surrounding. To evaluate this, we introduced a large-scale Pano-AVQA dataset consisting of 51.7K QA pairs with bounding boxes from 5.4K panoramic videos. Also, we designed LAViT as a new audio-visual QA transformer framework that extends cross-modal attention to leverage three modalities.

Moving forward, for better reasoning in videos, it can incorporate audio-visual scene graphs as an additional annotation. Another promising direction to use our datasets to address Embodied Question Answering (EQA) [11, 19, 43, 36] and language-guided embodied navigation [54, 55] in a simulated 3D interactive environment.

Acknowledgement. We thank the anonymous reviewers for their thoughtful suggestions on this work. This work was supported by AIRS Company in Hyundai Motor Company & Kia Corporation through HKMC-SNU AI Consortium Fund, Brain Research Program by National Research Foundation of Korea (NRF) (2017M3C7A1047860) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-01082, SW StarLab). Gunhee Kim is the corresponding author.


  • [1] H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson, et al. (2019) Audio Visual Scene-Aware Dialog. In CVPR, Cited by: §2, §2, §5.1, Table 2.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In ICCV, Cited by: §2, §5.2.
  • [3] R. Arandjelovic and A. Zisserman (2018) Objects that sound. In ECCV, Cited by: §2.
  • [4] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer Normalization. In

    NIPS Deep Learning Symposium

    Cited by: §4.2.
  • [5] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) nuScenes: A Multimodal Dataset for Autonomous Driving. In CVPR, Cited by: §1.
  • [6] Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) UNITER: UNiversal Image-TExt Representation Learning. In ECCV, Cited by: §4.
  • [7] H. Cheng, C. Chao, J. Dong, H. Wen, T. Liu, and M. Sun (2018)

    Cube Padding for Weakly-Supervised Saliency Prediction in 360 Videos

    In CVPR, Cited by: §1, §2.
  • [8] P. Chi, P. Chung, T. Wu, C. Hsieh, S. Li, and H. Lee (2021)

    Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation

    In IEEE SLT, Cited by: §4.3.
  • [9] S. Chou, Y. Chen, K. Zeng, H. Hu, J. Fu, and M. Sun (2018) Self-View Grounding Given a Narrated 360 Video. In AAAI, Cited by: Table 1, §2.
  • [10] Y. Chuang, C. Liu, H. Lee, and L. Lee (2020) SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering. In INTERSPEECH, Cited by: §4.3.
  • [11] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied Question Answering. In CVPR, Cited by: §6.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, Cited by: §4.1, §4.1, §4.3, §5.1, Table 2.
  • [13] K. Drossos, S. Lipping, and T. Virtanen (2020) Clotho: an Audio Captioning Dataset. In ICASSP, Cited by: §3.3.
  • [14] H. M. Fayek and J. Johnson (2019) Temporal Reasoning via Audio Question Answering. arXiv:1911.09655. Cited by: §2.
  • [15] C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba (2019) Self-Supervised Moving Vehicle Tracking with Stereo Sound. In ICCV, Cited by: §4.1.
  • [16] R. Gao and K. Grauman (2019) 2.5D Visual Sound. In CVPR, Cited by: §2.
  • [17] N. Garcia, M. Otani, C. Chu, and Y. Nakashima (2020) KnowIT VQA: Answering Knowledge-Based Questions about Videos. In AAAI, Cited by: §2.
  • [18] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In ICASSP, Cited by: Figure 2, §3.3, §3.4, §4.1.
  • [19] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi (2018) Iqa: visual question answering in interactive environments. In CVPR, Cited by: §6.
  • [20] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

    In ICML, Cited by: §4.1.
  • [21] Y. Heshmat, B. Jones, X. Xiong, C. Neustaedter, A. Tang, B. E. Riecke, and L. Yang (2018) Geocaching with a Beam: Shared Outdoor Activities through a Telepresence Robot with 360 Degree Viewing. In CHI, Cited by: §1.
  • [22] D. Hu, F. Nie, and X. Li (2019) Deep Multimodal Clustering for Unsupervised Audiovisual Learning. In CVPR, Cited by: §2.
  • [23] H. Hu, Y. Lin, M. Liu, H. Cheng, Y. Chang, and M. Sun (2017) Deep 360 Pilot: Learning a Deep Agent for Piloting through Sports Video. In CVPR, Cited by: Table 1, §2.
  • [24] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim (2017) TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering. In CVPR, Cited by: §2, §3.3.
  • [25] C. D. Kim, B. Kim, H. Lee, and G. Kim (2019) AudioCaps: Generating Captions for Audios in the Wild. In NAACL, Cited by: §3.3.
  • [26] K. Kim, M. Heo, S. Choi, and B. Zhang (2019) DeepStory: Video Story QA by Deep Embedded Memory Networks. In IJCAI, Cited by: §2.
  • [27] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley (2020)

    PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

    IEEE/ACM TASLP. Cited by: §3.4, §4.1.
  • [28] C. Lee, S. Wang, H. Chang, and H. Lee (2018) ODSQA: Open-Domain Spoken Question Answering Dataset. In IEEE SLT, Cited by: §2.
  • [29] S. Lee, J. Sung, Y. Yu, and G. Kim (2018) A Memory Network Approach for Story-Based Temporal Summarization of 360 Videos. In CVPR, Cited by: Table 1, §2.
  • [30] J. Lei, L. Yu, M. Bansal, and T. L. Berg (2018) TVQA: Localized, Compositional Video Question Answering. In EMNLP, Cited by: §2.
  • [31] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv:1908.03557. Cited by: §4.
  • [32] W. Lo, C. Fan, J. Lee, C. Huang, K. Chen, and C. Hsu (2017) 360 Video Viewing Dataset in Head-mounted Virtual Reality. In ACM MMSys, Cited by: §1.
  • [33] I. Loshchilov and F. Hutter (2018) Decoupled Weight Decay Regularization. In ICLR, Cited by: §4.3.
  • [34] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS, Cited by: §4.1, §4.
  • [35] Y. Masuyama, Y. Bando, K. Yatabe, Y. Sasaki, M. Onishi, and Y. Oikawa (2020) Self-Supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling. In IROS, Cited by: §1, §2, §2.
  • [36] P. Mirowski, M. K. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, D. Teplyashin, K. Simonyan, K. Kavukcuoglu, A. Zisserman, and R. Hadsell (2018) Learning to navigate in cities without a map. In NeurIPS, Cited by: §6.
  • [37] P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang (2018) Self-Supervised Generation of Spatial Audio for 360 Video. In NIPS, Cited by: §1, Table 1, §2.
  • [38] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal Deep Learning. In ICML, Cited by: §2.
  • [39] W. Norcliffe-Brown, S. Vafeias, and S. Parisot (2018) Learning Conditioned Graph Structures for Interpretable Visual Question Answering. In NeurIPS, Cited by: §5.1, §5.1, §5.2, Table 2.
  • [40] A. Owens and A. A. Efros (2018) Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In ECCV, Cited by: §2.
  • [41] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS, Cited by: §4.1.
  • [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §3.3, §4.1.
  • [43] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al. (2019) Habitat: a platform for embodied ai research. In ICCV, Cited by: §6.
  • [44] R. Shimamura, Q. Feng, Y. Koyama, T. Nakatsuka, S. Fukayama, M. Hamasaki, M. Goto, and S. Morishima (2020) Audio–Visual Object Removal in 360-Degree Videos. The Visual Computer. Cited by: §2.
  • [45] D. V. Smith, B. Davis, K. Niu, E. W. Healy, L. Bonilha, J. Fridriksson, P. S. Morgan, and C. Rorden (2010) Spatial Attention Evokes Similar Activation Patterns for Visual and Auditory Stimuli. Journal of Cognitive Neuroscience. Cited by: §1.
  • [46] M. Speicher, J. Cao, A. Yu, H. Zhang, and M. Nebeling (2018) 360Anywhere: Mobile Ad-hoc Collaboration in any Environment using 360 Video and Augmented Reality. ACM HCI. Cited by: §1.
  • [47] N. Srivastava and R. R. Salakhutdinov (2012)

    Multimodal Learning with Deep Boltzmann Machines

    In NeurIPS, Cited by: §2.
  • [48] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR, Cited by: §4.
  • [49] Y. Su, D. Jayaraman, and K. Grauman (2016) Pano2Vid: Automatic Cinematography for Watching Videos. In ACCV, Cited by: §1, Table 1, §2.
  • [50] H. Tan and M. Bansal (2019) LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP, Cited by: §4.2, §4, §5.1, §5.2, Table 2.
  • [51] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler (2016) MovieQA: Understanding Stories in Movies through Question-Answering. In CVPR, Cited by: §2.
  • [52] A. B. Vasudevan, D. Dai, and L. Van Gool (2020)

    Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

    In ECCV, Cited by: Table 1, §2, §2.
  • [53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is All You Need. In NeurIPS, Cited by: §1, §4.2.
  • [54] E. Wijmans, S. Datta, O. Maksymets, A. Das, G. Gkioxari, S. Lee, I. Essa, D. Parikh, and D. Batra (2019) Embodied question answering in photorealistic environments with point cloud perception. In CVPR, Cited by: §6.
  • [55] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian (2018) Building generalizable agents with a realistic and rich 3d environment. arXiv:1801.02209. Cited by: §6.
  • [56] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    arXiv:1609.08144. Cited by: §4.1.
  • [57] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §3.2, §3.3.
  • [58] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang (2017) Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM MM, Cited by: §3.3.
  • [59] Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao, and Y. Zhuang (2017) Video Question Answering via Attribute-Augmented Attention Network Learning. In SIGIR, Cited by: §2.
  • [60] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea, M. Uricár, S. Milz, M. Simon, K. Amende, et al. (2019) WoodScape: A Multi-task, Multi-camera Fisheye Dataset for Autonomous Driving. In ICCV, Cited by: §1.
  • [61] Y. Yu, S. Lee, J. Na, J. Kang, and G. Kim (2018) A Deep Ranking Model for Spatio-Temporal Highlight Detection from a Video. In AAAI, Cited by: Table 1, §2.
  • [62] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao (2019) ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering. In AAAI, Cited by: §2.
  • [63] K. Zeng, T. Chen, C. Chuang, Y. Liao, J. C. Niebles, and M. Sun (2016) Leveraging Video Descriptions to Learn Video Question Answering. In ECCV, Cited by: §3.3.
  • [64] Z. Zhang, Y. Xu, J. Yu, and S. Gao (2018) Saliency Detection in Videos. In ECCV, Cited by: §1, §2.
  • [65] H. Zhao, C. Gan, W. Ma, and A. Torralba (2019) The Sound of Motions. In ICCV, Cited by: §2.
  • [66] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018) The Sound of Pixels. In ECCV, Cited by: §2.
  • [67] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao (2020)

    Unified Vision-Language Pre-Training for Image Captioning and VQA

    In AAAI, Cited by: §4.