Visual Question Answering on 360° Images

01/10/2020 ∙ by Shih-Han Chou, et al. ∙ 15

In this work, we introduce VQA 360, a novel task of visual question answering on 360 images. Unlike a normal field-of-view image, a 360 image captures the entire visual content around the optical center of a camera, demanding more sophisticated spatial understanding and reasoning. To address this problem, we collect the first VQA 360 dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types. We then study two different VQA models on VQA 360, including one conventional model that takes an equirectangular image (with intrinsic distortion) as input and one dedicated model that first projects a 360 image onto cubemaps and subsequently aggregates the information from multiple spatial resolutions. We demonstrate that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the equirectangular-based models. Nevertheless, the gap between the humans' and machines' performance reveals the need for more advanced VQA 360 algorithms. We, therefore, expect our dataset and studies to serve as the benchmark for future development in this challenging task. Dataset, code, and pre-trained models are available online.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 8

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual question answering (VQA) has attracted significant attention recently across multiple research communities. In this task, a machine needs to visually perceive the environment, understand human languages, and perform multimodal reasoning—all of them are essential components to develop modern AI systems. Merely in the past three years, more than two dozen datasets have been published, covering a wide variety of scenes, language styles, as well as reasoning difficulties [2, 19, 21, 25, 37, 38, 52]. Together with those datasets are over a hundred algorithms being developed, consistently shrinking the gap between humans’ and machines’ performance [4, 18, 26, 27, 28].

Despite such an explosive effort, existing work is constrained in the way a machine visually perceives the world. Specifically, nearly all the datasets use normal field-of-view (NFOV) images taken by consumer cameras. Convolutional neural networks (CNNs) that are carefully designed for such images 

[23, 39] have been necessary to extract powerful visual features. Nevertheless, NFOV images are not the only way, and very likely not the most efficient way, for a machine to interact with the world. For example, considering a 360 horizontally surrounding scene, the NFOV of a consumer camera can only capture an portion [45]. Such a fact, together with the reduced price of 360 cameras (e.g., Ricoh Theta S, Samsung Gear 360, and GoPro Omni), has motivated researchers to dig into 360 vision [9, 10, 24, 43]. We could imagine every robot to be equipped with a 360 camera in the near future. It is thus desirable to extend VQA to such an informative visual domain.

In this work, we make the first attempt toward VQA on 360 images (VQA 360

). Two major challenges immediately emerge. First, modern deep learning algorithms are heavily data consuming, yet so far, there is no publicly available dataset for VQA 360

. Second, 360 (i.e., equirectangular) images have intrinsic distortion and larger spatial coverage, requiring a novel way to process visual inputs and perform sophisticated spatial reasoning. Specifically, a machine needs to understand the spatial information in questions, search answers across the entire 360 scene, and finally aggregate the information to answer.

To resolve the first challenge, we collect the first real VQA 360 dataset, using 360 images from real-world scenes. Our dataset contains about 17,000 image-question-answer triplets with human-annotated answers (see an example in Figure 1). We have carefully taken the bias issue [21, 26], which many existing VQA datasets suffer, into account in designing our dataset. We thus expect our dataset to benefit the development of this novel task.

In addition, we study two models to address VQA 360. On the one hand, we use equirectangular images as input, similar to conventional VQA models on NFOV images. On the other hand, to alleviate spatial distortion, we represent an input 360 image by six cubemaps [22]. Each map has its own spatial location and suffers less distortion (cf. Figure 2). We develop a multi-level attention mechanism with spatial indexing to aggregate information from each cubemap while performing reasoning. In this way, a machine can infer answers at multiple spatial resolutions and locations, effectively addressing the algorithmic challenge of VQA 360. Moreover, cubemap-based architecture is flexible to take existing (pre-trained) VQA models as backbone feature extractors on cubemaps, effectively fusing multimodal information and overcoming the limited data issue.

We conduct extensive empirical studies to evaluate multiple variants of these models. The superior performance by the cubemap-based model demonstrates the need to explicitly consider intrinsic properties of VQA 360, both visually and semantically. By analyzing the gap between the machine’s and the human’s performance, we further suggest future directions to improve algorithms for VQA 360.

Our contributions in this work are two-fold:

  • We define a novel task named VQA 360. We point out the intrinsic difficulties compared to VQA on NFOV images. We further collect the first real VQA 360 dataset, which is designed to include complicated questions specifically for 360 images.

  • We comprehensively evaluate two kinds of VQA models for VQA 360, including one that can effectively handle spatial distortion while performing multi-level spatial reasoning. We then point out future directions for algorithm design for VQA 360.

Figure 2: 360 image and cubemaps. A equirectangular 360 image can be represented by six cubemaps, each corresponding to a spatial location, to reduce spatial distortion.

2 Related Work

VQA models.

Visual Question Answering requires comprehending and reasoning with visual (image) and textual (question) information [51]. The mainstream of model architectures is to first learn the joint image-question representation and then predict the answer through multi-way classification. In the first stage, two mechanisms, visual attention [1, 48, 36] and multimodal fusion [18, 4], have been widely explored. For example, the stacked attention networks (SANs) [49] was developed to perform multi-round attention for higher-level visual understanding. On the other hand, Fukui et al. [18] proposed the Multimodal Compact Bilinear pooling (MCB) to learn a joint representation, and Ben et al. [4]

developed a tensor-based Tucker decomposition to efficiently parameterize the bilinear interaction. Recently, several work 

[8, 34, 35, 42] extended BERT [15] by developing new pre-training tasks to learn (bidirectional) transformers [46] for joint image and text representations.

Despite the variety of architectures, most of existing methods directly apply CNNs to the whole NFOV image to extract (local) features, which may not be suitable to 360 images. In this paper, we explore a different architecture to extract CNN features from the cubemap representations of a 360 image and then fuse features across cubemaps. The cubemap-based model shares some similarity to [1, 49], yet we apply multiple-rounds of attentions to different spatial resolutions, one within and one across cubemaps, so as to achieve better spatial understanding.

VQA datasets.

There have been over two dozen of VQA datasets on NFOV images published in recent years. Most of them aim for open-ended answering [2, 21, 32], providing for a pair of image and question with one or multiple correct answers [6, 52]. An alternative setting is multiple-choice answering: a set of candidate answers are provided for each question, in which one of them is correct. Our VQA 360 dataset belongs to the first category but focuses on a very different input domain, 360 images.

We note that there are two emerging VQA tasks, embodied QA [13] and interactive QA [20], that require a machine to interact with the 3D environment (e.g., turn right or move closer). Our dataset and task are different, from two aspects. First, we work on real-world scenes, while both of them are on synthetic ones. Second, we take 360 images as input while they take NFOV images. A machine there has to take actions to explore the environment, being less efficient.

360 vision.

With the growing popularity of virtual reality (VR) and augmented reality (AR), 360 images and videos have attracted increasing attention lately. One of the interesting problems is to automatically navigate a 360 video [24, 43, 45] or create a fast-forward summary [33]. Other research topics include 360 video stabilization [31], compression [44], saliency prediction [9]

, depth estimation 

[14], and object detection [11, 43]. Recently, Chou et al. [10] study visual grounding to localize objects in a 360 video for a given narrative, while Chen et al. [7] explore natural language navigation in 360 street environments. In contrast to these tasks, VQA on 360 images requires further inferring the answers according to questions, demanding more sophisticated reasoning of the scene.

Q type Template Example Answer
Scene
What room is depicted in the image?
What room is depicted in the image?
bedroom/…
Exist
Is/Are there (a) obj1 ___?
+ in the scene
+ direc
+ direc of the obj2
+ direc of the obj2 in the scene
Is there a chair in the kitchen?
Is there a chair at my right side?
Is there a chair at the right side of the window?
Is there a chair at the right side of the window in the kitchen?
yes/no
Counting
How many obj1 are ___?
+ in the scene
+ direc
+ direc of the obj2
+ direc of the obj2 in the scene
How many chairs are in the kitchen?
How many chairs are at my right side?
How many chairs are at the right side of the window?
How many chairs are at the right side of the window in the kitchen?
0/1/2/…
Property
What is the (color) obj1 ___  made of?
What is the color of the obj1 ___?
+ in the scene
+ direc
+ direc of the obj2
+ direc of the obj2 in the scene
What is the red sofa in the bedroom made of?
What is the red sofa at my right side made of?
What is the color of the sofa at the right of the window?
What is the color of the sofa at the right of the window in the bedroom?
plastic/wood/…
red/brown/…
Spatial
Where can I find the ___ obj1?
Which side of the ___  obj1 is the ___ obj2?
+ color
+ material
Where can I find the white flowers?
Which side if the white chair is the wooden door?
in front of you/…
right side/…
Table 1: Question templates and examples. We design the following question templates and utilize the scene types and semantic segmentation of the images to automatically generate questions.

3 Vqa 360 Dataset

We first present the proposed VQA 360 dataset to give a clear look at the task and its intrinsic challenges. We begin with the dataset construction, including image collection, question generation, and answer annotation. We then provide detailed statistics for our VQA 360 dataset.

3.1 Images Collection

We focus on indoor scenes as they are usually more dense with contents such as objects, which are suitable for developing algorithms for sophisticated reasoning. In contrast, outdoor scenes, like those in [24, 33, 44, 45], capture certain (ego-centric) activities and are of sparse contents, which are more suitable for summarization or navigation.

We collect 360 images of indoor scenes from two publicly accessible datasets, Stanford 2D-3D [3] and Matterport3D [5]. Both datasets provide useful side information such as scene types and semantic segmentation, which benefit question generation. There are about different scenes, including common areas in houses (e.g., bathroom, kitchen, bedroom, etc.) and workplaces (e.g., office, conference room, auditorium, etc.). To maximize the image diversity, we discard images captured in the same room but with different viewpoints. In total, we collect images from the Stanford 2D-3D dataset and images from the Matterport3D dataset.

All the 360 images are stored in the equirectangular format and resized to . The equirectangular projection maps latitude and longitude of a sphere to the horizontal and vertical lines (e.g., a point at the top of the sphere is mapped to a straight line in an equirectangular image), which inevitably introduces heavy spatial distortion.

3.2 Question Generation

We design several question templates (c.f. Table 1), together with the semantic segmentation and scene types associated with each 360 image222We can obtain room types and objects appearing in the scenes., to automatically generate questions. Our templates contain five different types: “scene”, “exist”, “counting”, “property” and “spatial”. While imposing templates limit the diversity of questions, the main purpose of our dataset is to promote VQA on a new visual domain that has larger spatial coverage and complexity. As illustrated in Figure 1, a 360 image can easily contain multiple objects distributed at multiple locations. We thus specifically design the question templates—either include spatial specifications or ask for spatial reasoning—to disambiguate the questions and encourage machines to acquire better spatial understanding. For instance, to answer “What is the color of the vase at the right of pictures?” in Figure 1, a machine needs to first find the pictures (rightmost), look to the right to find the vase, and return the color333There are three vases in Figure 1. Adding spatial specifications is thus necessary, and different specifications will lead to different answers.. To answer “Which side of the TV is the pictures?”, a machine needs to detect the TV and picture, and then return their relative spatial information in the scene. Both examples require visual and spatial understanding at multiple resolutions and locations, which are scarce in existing VQA datasets on NFOV images (see the supplementary material for details). On average, we create questions per image.

3.3 Answer Annotations & Question Refinements

We resort to human annotators to provide precise answers. We ask in-house annotators to answer the questions in our dataset. To avoid synonyms words and to ease the process, we offer candidate answers according to the question types for annotators to select directly. Annotators can also type free-form answers if none of the candidates is applicable. We note that the automatically generated questions might be irrelevant to the image or lead to ambiguous answers444For instance, if there are two chairs with different colors, a question “What is the color of the chair?” will lead to ambiguous answers.. In such cases, we instruct the annotators to slightly modify the questions—e.g., by adding spatial specifications—to make them image-related or identifiable. We also instruct annotators to draw bounding boxes (for a subset of image-question pairs), which indicate specific objects or locations associated with the answer. Such information facilitates the analysis of model performances.

Training Validation Test
images 743 148 599
QA pairs 8227 1756 6962
unique answers 51 51 53
Scene type Q 765 150 614
Counting type Q 1986 495 1934
Existed type Q 2015 417 1655
Property type Q 1355 322 1246
Spatial type Q 2106 372 1513
Table 2: Summary of 360 VQA dataset. We summarize the number of images, QA pairs, and unique answers in each split of our dataset. We also provide a detailed statistic for each type of question.
Figure 3: Distribution of answers. We balance our dataset such that the answers of the same question type appear uniformly (e.g., “yes/no”, “0/1”, and “right side/left side”).

3.4 Dataset Statistics

Our VQA 360 dataset consists of images and question-answer pairs, which are split into the training, validation, and test sets with , , and of images, respectively. We summarize the statistics in Table 2 and show the distribution of the top answers in Figure 3. We note that each question type has at least corresponding answers in the top ones. Moreover, those from the same type have the similar number of presence (e.g., “yes/no”, “0/1”, “right/left side”), preventing a machine from cheating by predicting the dominant answer. For question types with a few unique answers, we make sure that the unique answers appear almost uniformly to minimize dataset bias.

4 Vqa 360 Models

Figure 4: VQA 360 models. We propose a cubemap-based architecture that first extracts visual features from the cubemaps of the input 360 image and then performs bottom-up multi-level attention and feature aggregation.

In this section, we study two VQA models, including one dedicated to resolving inherent challenges in VQA 360.

Notations and problem definitions.

Given a question and an image , a machine needs to generate the answer . One common VQA model is to first extract visual features and question features , followed by multimodal representations

. The multimodal representations are then inputted into a classifier

of classes, corresponding to the top frequent answers, to generate the answer . Representative choices for and are CNN and RNN models [49], respectively.

4.1 Equirectangular-based Models

As the most common format to store and display a 360 image is the equirectangular projection into a 2D array, we can indeed directly apply existing (pre-trained) VQA models for VQA 360. We take the Multimodal Low-rank Bilinear Attention Network (MLB) model [28] as an example, which adopts an efficient bilinear interaction for . We first extract the visual features by the pre-trained ResNet-152 [23]

and adopt the Gated Recurrent Units (GRU) 

[12, 30] to extract the question features . We then input the resulting into a fully-connected layer with output units to build a -way classifier . We optimize the whole network using the training set of our VQA 360 dataset and set to be the number of unique training answers (i.e., ).

The MLB model pre-trained on the VQA-1 [2] dataset requires to retain a spatial resolution, equivalent to inputting a image to the ResNet. We thus adopt a few strategies, including cropping or resizing the original 360 image, or inputting the original image while resizing the output ResNet features into a spatial resolution by an average pooling layer. We analyze these strategies in Section 5.

Challenges.

While the above strategies allow us to exploit VQA models pre-trained on much larger NFOV datasets (e.g., VQA-1 [2]), applying CNNs directly on 360 images suffers the inherent spatial distortion [43]. On the other hand, adopting specifically designed spherical convolutions [43] prevents us from leveraging existing models and pre-trained weights. An intermediate solution that takes both concerns into account is thus desirable.

Moreover, existing VQA models like MLB [28] and SAN [49] only consider a single visual resolution when performing feature aggregation in . For 360 images that cover a large spatial range, a more sophisticated mechanism that involves multiple resolutions of feature aggregation is required. To this end, we propose a cubemap-based model to simultaneously tackle the above challenges.

4.2 Cubemap-based Models

To reduce spatial distortion, we first represent a 360 image by six non-overlapping cubemaps, , via the perspective projection (c.f. Figure 2; see the supplementary material for details). Each cubemap corresponds to a specific portion of the 360 image with less distortion. Collectively, the cubemaps together can recover the original image. This representation naturally leads to a bottom-up architecture that begins with the local region understanding and then global reasoning (cf. Figure 4).

In the first stage, we can apply any existing VQA models, e.g., MLB [28], to each cubemap individually, resulting in local multimodal representations:

(1)

where denotes the visual features of the -th cubemap.

Bottom-up multi-level attention.

In the second stage, the main challenge is to effectively aggregate information from cubemaps. While average and max pooling have been widely used, they simply ignore the location associated with each cubemap. We thus resort to the attention mechanism:

(2)

The attention weight can be computed according to information of each cubemap, including its location, making aggregation more flexible. As many existing VQA models already apply the attention mechanism within the input images [28, 49] (e.g., a cubemap in our cases), the attention to aggregate across cubemaps is actually the second-level of attention but on a coarse resolution.

We apply Tucker fusion  [4] to compute the attention weights according to the cubemap feature , location indicator , and question feature : Tucker fusion has been shown effective and efficient in fusing information from multiple modalities. The resulting is as follows,

(3)

where means concatenation. The softmax is performed over

. We use a one-hot vector

to encode the cubemap location. In this way, the attention weights can zoom into the cubemap location mentioned in the question.

Attention diffusion.

The attention weighs by (3); however, do not explicitly consider spatial relationship across cubemaps. For a question like “Is there a chair at the right side of the window?”, we would expect the model to first attend to the cubemap that contain the window, and then shift its attention to the cubemap at the right. To incorporate such a capability, we learn a diffusion matrix conditioned on the question : the entry indicates how much attention to be shifted from the cubemap to . The resulting formula for in (2) becomes:

(4)

Answer prediction.

The resulting feature in (4) or (2) then undergoes another Tucker fusion to extract higher-level image-question interactions before inputted into the classifier . We can also replace in (4) or (2) by the concatenation of and to incorporate location cues into . This strategy is, however, meaningless to average or max pooling—it simply results in an all-one vector. We illustrate the overall model architecture in Figure 4. More details are included in the supplementary material.

5 Experimental Results

5.1 Setup

Variants of cubemap-based models.

The cubemap-based model can take any existing VQA model as the backbone. We choose the MLB model [28]

, a bilinear multimodal fusion and attention model. We experiment with other VQA backbones 

[4, 40] in the supplementary material to demonstrate the applicability of the cubemap-based models.

We remove the fully-connected layer of the original MLB model to extract multimodal features. We apply the pre-trained MLB model to each cubemap of size , and consider the following three different aggregation schemes before performing the final answer prediction.

  • Cubemap-Avgpool: apply average pooling on .

  • Tucker: attention weights by Tucker fusion in (3).

  • Tucker&Diffusion: attention weights by Tucker fusion followed by the diffusion in (4).

Variants of equirectangular-based models.

We consider four ways to apply MLB on the equirectangular images.

  • Central-crop: resize the shorter size of the image to to preserve the aspect ratio and then crop the image to to extract ResNet features.

  • Resize: resize the image into without any cropping and extract ResNet features.

  • ResNet-Avgpool: resize the shorter size of the image to and apply an average pooling layer on the ResNet output to obtain resolution features.

  • Direct-split: split an equirectangular image into patches, resize each to and apply MLB, and then apply Tucker&Diffusion to aggregate information for predicting the answer.

Note that the Direct-split and Tucker&Diffusion models have the same architecture but different inputs.

Baselines.

We provide Q-type prior, a model that outputs the most frequent answer of each question type.

Model Variants Overall avg Avg by type Scene Exist Counting Property Spatial
Q-type prior - 33.50 31.71 25.41 55.47 33.56 21.99 22.14
Equirectangular-based Central-crop 53.39 54.07 60.66 75.00 47.10 50.16 37.45
Equirectangular-based Resize 54.21 55.77 68.46 75.66 47.31 51.48 35.96
Equirectangular-based ResNet-Avgpool 54.47 56.14 69.34 76.81 46.32 50.96 37.25
Equirectangular-based ResNet-Avgpool 54.15 55.55 67.48 77.17 46.17 49.04 37.90
Equirectangular-based Direct-split 54.77 56.59 71.36 75.75 46.68 49.56 39.62
Cubemap-based Cubemap-Avgpool 54.60 56.23 69.17 76.22 46.79 51.72 37.26
Cubemap-based Tucker 57.71 59.07 69.89 77.23 46.53 48.24 53.47
Cubemap-based Tucker&Diffusion 58.66 60.26 72.01 76.34 46.84 50.12 55.98
Cubemap-based Tucker&Diffusion 54.09 55.54 67.65 76.16 45.91 48.60 39.39
Table 3: Quantitative results on the VQA 360 test set. The models are trained from scratch on the VQA 360 training set without pre-training on the VQA-1. The best result of each column is marked by the bold black color.

Implementation details.

We first pre-train the backbone MLB model on the VQA-1 [2] dataset, which contains over NFOV images and question-answer pairs for training. Then, we plug the pre-trained model in all the compared models and fine-tune the models on our VQA 360 training set for epochs. We optimize our models with the ADAM [29] optimizer and select the model with the best performance on the validation set.

Evaluation metric.

We use the top-1 accuracy for evaluation. We report two types of accuracy: the average accuracy i) over all the questions, and ii) over question types.

5.2 Analysis and Discussions

Table 3 summarizes the results on VQA 360 test set. The cubemap-based model with Tucker&Diffusion for attention weights performs favorably against other models, demonstrating the effectiveness of multi-level and diffused attention on top of cubemaps representation for VQA 360. In the following, we discuss several key observations.

Limited language bias.

The top row (Q-type prior) in Table 3 examines the dataset bias, which predicts the most frequent answer of each question type. The inferior results suggest a low language bias in our dataset. Specifically, for “exist” type questions that only have two valid answers each (i.e, “yes” or “no”), using language prior is close to random guess. Machines need to rely on images to answer.

Equirectangular-based models.

As shown in Table 3, the ResNet-Avgpool model outperforms the Central-crop and Resize, indicating the poor applicability of cropping and resizing to 360 images. Since 360 images have large spatial coverage, in which objects might be of small sizes, resizing will miss those small objects while central cropping will lose of the image content.

Cubemaps v.s. Equirectangular input.

One major issue of applying existing VQA models directly to the 360 images is the spatial distortion. This is justified by the fact that all the equirectangular-based models are outperformed by all the cubemap-based models (except the Cubemap-Avgpool one) on the overall performance. Specifically, by comparing the Direct-split and Tucker&Diffusion, whose main difference is the input, the performance gap clearly reflects the influence of distortion. By looking into different question types, we also observe consistent improvements by applying cubemaps.

Pre-training.

Comparing the models with (trained from scratch) and without (with pre-training), the pre-trained weights (from the VQA-1 dataset) benefits the overall performance, especially for the cubemap-based models.

Attention.

Applying cubemaps resolves one challenge of VQA 360: spatial distortion. We argue that a sophisticated way to aggregate cubemaps features to support spatial reasoning is essential to further boost the performance. This is shown from the improvement by Tucker&Diffusion or Tucker, compared to Cubemap-Avgpool: the former two apply attention mechanisms guided by questions and cubemap locations for multi-level attention. Specifically, Tucker&Diffusion outperforms Cubemap-Avgpool by a notable at Avg. by Q type, mostly from the “spatial” question type. Tucker&Diffusion with spatial diffusion also outperforms Tucker in all the question types.

Figure 5: Visualization of attention. We use the cubemap-based model Tucker&Diffusion as it performs the best. The digits below the cubemaps indicate the attention across cubemaps. The heat maps indicate the attention within cubemaps.
Model Avg. Avg. by Q type Spatial
Tucker (w/o) 53.81 53.81 36.09
Tucker (w/) 57.71 59.07 53.47
Tucker&Diffusion (w/o) 54.91 56.51 39.13
Tucker&Diffusion (w/) 58.66 60.26 55.98
Table 4: Comparison of w/ and w/o location feature.
Model Overall Scene Exist Counting Property Spatial
Human 84.05 88.95 91.79 71.58 89.97 85.25
Machine 59.80 68.89 77.12 49.65 45.81 61.97
Table 5: Results of human evaluation. We also include the machine’s performance on the same 1,000 questions to analyze the humans’ and machines’ gap.

Location feature.

Concatenating with in (2) and (4) enables our model to differentiate cubemaps. Table 4 compares the Tucker&Diffusion and Tucker with/without . The location indicator leads to consistent improvement, especially on the “spatial” type questions.

Human Evaluation.

We conduct a user study on our VQA 360 dataset. We sample image-question-answer triplets from the test set and ask at least two different users to answer each question. To ease the process, we give users five candidate answers, including the correct answer and four other answers that are semantically related to the question. There are a total of unique users participating in the user study. We note that the annotators labeling our dataset are not involved in the human evaluation to avoid any bias.

We summarize the results of human evaluation and the machine’s prediction555We use our best cubemap-based model Tucker&Diffusion. in Table 5. Humans achieve a overall accuracy, which is at the same level as many existing VQA datasets [2, 6, 50] and is much higher than another dataset on indoor images [37], justifying the quality of our VQA 360 dataset. Among the five question types, humans perform relatively poorly on “counting”, which makes sense due to the complicated contents of images and the possible small objects. Overall, there is about performance gap between human and machines. The gap is larger especially on “counting”, “property”, and “spatial” types, suggesting the directions to improve algorithms so as to match humans’ inference abilities.

Qualitative results.

We present qualitative results in Figure 5. Besides showing the predicted answers, we visualize the attention weights across cubemaps (by the digits) and within cubemaps (by the heat maps). The cubemap-based model with Tucker&Diffusion can zoom in to the cubemaps related to the questions, capture the answer regions, and aggregate them to predict the final answers. Take the question “Which side of the window is the painting?” for example (the top-left one of Figure 5). The model puts high attention on the cubemaps with windows and pictures and is able to infer the relative location. For the question “What room is depicted in the image?” (the top-right of Figure 5), the model distributes attention to all cubemaps except the top and bottom ones to learn information through them. We also show a failure case in the bottom-right of Figure 5. The question asks “Which side of the door is the whiteboard?”. However, the model mistakenly recognizes the window as the white board and incorrectly answers “right side”.

6 Discussion and Conclusion

We introduce VQA 360, a novel VQA task on a challenging visual domain, 360 images. We collect the first VQA 360 dataset and experiment with multiple VQA models. We then present a multi-level attention model to effectively handle spatial distortion (via cubemaps) and perform sophisticated reasoning. Experimental results demonstrate the need to explicitly model intrinsic properties of 360 images, while the noticeable gap between humans’ and machines’ performance reveals the difficulty of reasoning on 360 images compared to NFOV images.

We surmise that the gap may partially be attributed to the hand-crafted cubemap cropping. On one end, objects appear around the cubemap boundaries may be splitted. On the other end, it requires specifically designed mechanisms (e.g., attention diffusion (4)) to reason the spatial relationship among cubemaps. These issues likely explain the human-machine gap at the “counting” and “spatial” questions. Thus, to advance VQA 360, we suggest developing image-dependent cropping that detects objectness regions from the equirectangular images. We also suggest developing a back-projection-and-inference mechanism that back-projects the detected objects into the 360 environment and performs reasoning accordingly. Besides, the current questions are generated (or initialized) by templates. A future work is to include more human efforts to increase the question diversity. We expect our dataset and studies to serve as the benchmark for the future developments.

Acknowledgments.

This work is supported in part by NSF CAREER (# 1149783) and MOST 108-2634-F-007-006 Joint Research Center for AI Technology and All Vista Healthcare, Taiwan.

7 Supplementary Material

In this section, we present additional results to complement the main paper.

  • Section 7.1: Details on data collection (cf. Section 3 in the main paper).

  • Section 7.2: Implementation details of the proposed model (cf. Section 4.2 and 5.1 in the main paper).

  • Section 7.3: Additional experimental results on the backbone VQA model and the answer prediction strategy for the cubemap-based models (cf. Section 5 in the main paper).

  • Section 7.6: Additional qualitative results (cf. Section 5.3 in the main paper).

7.1 Data Collection

Question generation.

We design templates with place holders (cf. in Table 2 of the main paper) to automatically generate questions. We fill in obj and scene according to the semantic segmentation and scene types given by the Stanford 2D-3D [3] and Matterport3D [5] datasets. We fill in color of obj according to the corresponding pixel values. For direc of obj, we derive it from the corresponding cubemap location. To generate questions with either “no” or “0” as the answer, we fill in the combinations of obj, color, and direc that are not shown in the images.

Answer annotation.

We provide specific guidance (cf. Figure 2 of the main paper) for the annotators to identify the direction and location in a image. We note that human annotators are allowed to modify the questions to make them less ambiguous or more related to the image contents. In details, we instruct human annotators to modify the questions by following the templates in Table 1 in the main paper. This flexibility also increases the diversity of the questions in our VQA 360 dataset.

Question types.

Our templates can be categorized into five different types: “scene”, “exist”, “counting”, “property” and “spatial”.

  • “Scene” type: related to scene or room types, e.g., kitchen, office, etc.

  • “Exist” type: related to object presences and positions.

  • “Counting’ type: for object counting and may involve object attributes and positions.

  • “Property” type: for object attributes, e.g., color and material.

  • “Spatial” type: related to objects’ relative positions and the photographer’s relative position.

The “exist”, “counting”, “property”, and “spatial” type questions generally require a model to infer answers from multiple locations (potentially across the entire scene) in a image.

NFOV vs. 360 images.

The demanding of spatial reasoning plays the key difference for Visual QA on 360 images (as mentioned in Section 3.2 in the main paper). Therefore, the proposed dataset includes questions for spatial reasoning or with spatial cues, either by the templates or by the annotators. Figure 6 shows examples used in the VQA2 paper [21]. This is also evidenced by less than of the questions belonging to the “where” type in VQA2. In contrast, objects in 360 images are highly distributed (even behind the observer), and we have more than “where” type questions.

Figure 6: Examples of VQA2 and VQA 360.

7.2 Implementation Details

We provide the implementation details of the proposed model, following the notations introduced in Section 4.1 in the main paper. We focus on Tucker&Diffusion, together with MLB [28] as the backbone VQA model and fusion aggregation to predict the answer (cf. Section 4.2 and Figure 4 in the main paper).

  • Step 1: Extract the question feature .

  • Step 2: Extract the visual feature for each cubemap .

  • Step 3: Extract the local multimodal feature for each cubemap , where is the MLB model without the last fully-connected layer.

  • Step 4: Compute the attention weight for each cubemap , , where is the Tucker fusion module [4] with an output dimension and is a one-hot location feature. We note that the Tucker fusion’s output dimensionality is adjustable by adding a fully-connected layer.

  • Step 5: Generate a diffusion matrix conditioned on the question .

  • Step 6: Compute the aggregated feature over cubemaps, , where we concatenate the location feature with the multimodal feature .

  • Step 7: Extract a higher-level multimodal feature , where is the Tucker fusion module with a multi-dimensional output.

  • Step 8: Feed in the classifier , implemented by a fully-connected layer, to predict the answer.

Cubemap projection.

The cube mapping projection is a commonly used method to project an equirectangular image onto NFOV planes [17, 9, 16, 47]. Specifically, there are six cube faces (top, front, left, behind, right and bottom) to fill the whole sphere as shown in Figure 7 in the main paper.

Figure 7: 360 image and cubemaps. A 360 image can be represented by six cubemaps, each corresponding to a specific spatial location, to reduce the spatial distortion.

We use the implementation in [9] to project the equirectangular images onto cubemaps.

7.3 Additional Experimental Results

We provide additional comparisons on the backbone VQA models and the cubemap-based models.

Model Variants Backbone Overall avg Avg by type Scene Exist Counting Property Spatial
Equirectangular-based ResNet-Avgpool MLB 54.47 56.14 69.34 76.81 46.32 50.96 37.25
Cubemap-based Cubemap-Avgpool MLB 55.03 56.89 71.41 76.14 47.72 52.77 36.42
Cubemap-based Tucker MLB 57.71 59.07 69.89 77.23 46.53 48.24 53.47
Cubemap-based Tucker&Diffusion MLB 58.66 60.26 72.01 76.34 46.84 50.12 55.98
Equirectangular-based ResNet-Avgpool MUTAN 52.05 53.35 65.08 73.18 46.12 47.00 35.38
Cubemap-based Cubemap-Avgpool MUTAN 53.56 53.56 69.07 74.86 46.37 50.72 35.45
Cubemap-based Tucker MUTAN 54.06 55.29 65.13 74.59 45.06 48.20 43.49
Cubemap-based Tucker&Diffusion MUTAN 54.08 55.80 69.82 75.57 46.32 49.32 37.96
Equirectangular-based - Pythia 50.37 49.59 45.02 43.51 72.69 47.63 39.09
Cubemap-based Cubemap-Avgpool Pythia 50.90 51.47 56.37 43.99 70.92 48.68 37.37
Cubemap-based Tucker Pythia 51.88 51.34 49.84 46.32 75.48 47.87 37.21
Cubemap-based Tucker&Diffusion Pythia 53.06 52.43 50.41 48.34 75.26 49.00 39.13
Table 6: Comparison on VQA backbone models. We use the MLB, MUTAN and Pythia VQA pre-trained models as the backbone in the proposed methods and evaluate the performance on our VQA 360 test set.
Model Variants Ans. Prediction Overall avg Avg by type Scene Exist Counting Property Spatial
Cubemap-based Cubemap-Avgpool Aggregation 55.03 56.89 71.41 76.15 47.72 52.77 36.42
Cubemap-based Cubemap-Avgpool Fusion Aggregation 54.60 56.23 69.17 76.22 46.79 51.72 37.26
Cubemap-based Tucker Aggregation 54.12 54.94 62.97 75.24 46.22 46.63 43.62
Cubemap-based Tucker Fusion Aggregation 57.71 59.07 69.89 77.23 46.53 48.24 53.47
Cubemap-based Tucker&Diffusion Aggregation 55.21 56.52 66.67 75.75 47.39 52.81 39.99
Cubemap-based Tucker&Diffusion Fusion Aggregation 58.66 60.26 72.01 76.34 46.84 50.12 55.98
Table 7: Comparison on the answer prediction strategies. We compare the aggregation and fusion aggregation (cf. Figure 4 of the main paper) methods with the cubemap-based models. We report results on the VQA 360 test set.

7.4 Comparisons on Backbone VQA Models

We compare two VQA pre-trained models, MLB [28], MUTAN [4] and Pythia [40, 41], as the backbone of the proposed method. Table 6 summarizes the results of an equirectangular model, ResNet-Avgpool, as well as three cubemap-based models, Cubemap-Avgpool, Tucker, and Tucker&Diffusion. We observe a similar trend as discussed in Section 5.2 of the main paper: the cubemap-based methods generally outperforms the equirectangular-based models, while the cubemap-based Tucker&Diffusion model with multi-level attention performs favorably against other variants. Note that the state-of-the-art model on NFOV images, Pythia [40, 41], does not perform better than the MLB and MUTAN models. The possible reasons are: 1) the object detector is not generalized well to the VQA 360 dataset, and 2) the cubemap project sometimes split an object into multiple parts. These observations indicate the potential future directions on exploring the adaptive cubemap projections or object detection on images.

7.5 Comparisons on Answer Prediction Strategies

As mentioned in Section 4 of the main paper, we fuse the multimodal feature with the question feature (which is named as Fusion Aggregation) before inputting to the classifier for predicting the answer. Here we study another simpler strategy – the multimodal feature is directly inputted into the classifier for prediction – which is named as Aggregation. In Table 7, we compare these two answer prediction strategies on the cubemap-based Cubemap-Avgpool, Tucker, and Tucker&Diffusion models.

For the Cubemap-Avgpool model, having a higher-level fusion degrades the performance. However, for the Tucker and Tucker&Diffusion models, the fusion aggregation clearly improves the overall performance. Since both the Tucker and Tucker&Diffusion use the location indicators as one of the features (see Step 6 in Section 7.2), the fusion aggregation is necessary to associate the question to certain cubemaps so as to answer questions such as “Which side of the tv is the pictures?” in Figure 1 in the main paper. We note that, as shown in Table 4 of the main paper, adding the location feature leads to notable improvements on “spatial” type questions.

7.6 Additional Qualitative Results

We provide more qualitative results using our cubemap-based model with Tucker&Diffusion in Figure 8 and Figure 9. For each image, we show both the correct predictions and failure cases. We observe two notable failure cases — colors and properties — both require accurately locating the objects according to the questions, especially for small objects. We suggest that further improvements can be achieved by advanced object detection in 360 images.

Figure 8: Qualitative results. We show both correct predictions and failure cases (highlighted by red font).
Figure 9: Qualitative results. We show both correct predictions and failure cases (highlighted by red font).

References

  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    .
    In CVPR, Cited by: §2, §2.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) VQA: visual question answering. In ICCV, Cited by: §1, §2, §4.1, §4.1, §5.1, §5.2.
  • [3] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese (2017)

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    .
    arXiv. Cited by: §3.1, §7.1.
  • [4] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome (2017) Mutan: multimodal tucker fusion for visual question answering. In ICCV, Cited by: §1, §2, §4.2, §5.1, 4th item, §7.4.
  • [5] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. In 3DV, Cited by: §3.1, §7.1.
  • [6] W. Chao, H. Hu, and F. Sha (2018) Being negative but constructively: lessons learnt from creating better visual question answering datasets. In NAACL, Cited by: §2, §5.2.
  • [7] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2019) Touchdown: natural language navigation and spatial reasoning in visual street environments. In CVPR, Cited by: §2.
  • [8] Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2019) Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: §2.
  • [9] H. Cheng, C. Chao, J. Dong, H. Wen, T. Liu, and M. Sun (2018)

    Cube padding for weakly-supervised saliency prediction in

    videos
    .
    In CVPR, Cited by: §1, §2, §7.2, §7.2.
  • [10] S. Chou, Y. Chen, K. Zeng, H. Hu, J. Fu, and M. Sun (2018) Self-view grounding given a narrated Video. In AAAI, Cited by: §1, §2.
  • [11] S. Chou, C. Sun, W. Chang, W. Hsu, M. Sun, and J. Fu (2019) 360-indoor: towards learning real-world objects in 360 indoor equirectangular images. arXiv. Cited by: §2.
  • [12] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    .
    arXiv. Cited by: §4.1.
  • [13] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied question answering. In CVPR, Cited by: §2.
  • [14] G. P. de La Garanderie and A. Atapour (2018) Eliminating the blind spot: adapting 3D object detection and monocular depth estimation to panoramic imagery. In ECCV, Cited by: §2.
  • [15] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv. Cited by: §2.
  • [16] T. El-Ganainy and M. Hefeeda (2016) Streaming virtual reality content. arXiv. Cited by: §7.2.
  • [17] Facebook Under the hood: building 360 video. Note: https://engineering.fb.com/video-engineering/under-the-hood-building-360-video/ Cited by: §7.2.
  • [18] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv. Cited by: §1, §2.
  • [19] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu (2015) Are you talking to a machine? dataset and methods for multilingual image question. In NIPS, Cited by: §1.
  • [20] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi (2018) IQA: visual question answering in interactive environments. In CVPR, Cited by: §2.
  • [21] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In CVPR, Cited by: §1, §1, §2, §7.1.
  • [22] N. Greene (1986) Environment mapping and other applications of world projections. IEEE CGA. Cited by: §1.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.1.
  • [24] H. Hu, Y. Lin, M. Liu, H. Cheng, Y. Chang, and M. Sun (2017) Deep 360 pilot: learning a deep agent for piloting through 360 sports videos. In CVPR, Cited by: §1, §2, §3.1.
  • [25] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, Cited by: §1.
  • [26] K. Kafle and C. Kanan (2017) An analysis of visual question answering algorithms. In ICCV, Cited by: §1, §1.
  • [27] V. Kazemi and A. Elqursh (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv. Cited by: §1.
  • [28] J. Kim, K. On, W. Lim, J. Kim, J. Ha, and B. Zhang (2017) Hadamard product for low-rank bilinear pooling. In ICLR, Cited by: §1, §4.1, §4.1, §4.2, §4.2, §5.1, §7.2, §7.4.
  • [29] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.1.
  • [30] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In NIPS, Cited by: §4.1.
  • [31] J. Kopf (2016) video stabilization. ACM TOG. Cited by: §2.
  • [32] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §2.
  • [33] W. Lai, Y. Huang, N. Joshi, C. Buehler, M. Yang, and S. B. Kang (2017) Semantic-driven generation of hyperlapse from video. TVCG. Cited by: §2, §3.1.
  • [34] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §2.
  • [35] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §2.
  • [36] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, Cited by: §2.
  • [37] M. Malinowski and M. Fritz (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, Cited by: §1, §5.2.
  • [38] M. Ren, R. Kiros, and R. Zemel (2015) Exploring models and data for image question answering. In NIPS, Cited by: §1.
  • [39] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1.
  • [40] A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen, M. Shah, M. Rohrbach, D. Batra, and D. Parikh (2018) Pythia-a platform for vision & language research. In SysML Workshop, NeurIPS, Vol. 2018. Cited by: §5.1, §7.4.
  • [41] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach (2019) Towards vqa models that can read. In CVPR, Cited by: §7.4.
  • [42] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.
  • [43] Y. Su and K. Grauman (2017) Learning spherical convolution for fast features from imagery. In NIPS, Cited by: §1, §2, §4.1.
  • [44] Y. Su and K. Grauman (2018) Learning compressible Video video isomers. In CVPR, Cited by: §2, §3.1.
  • [45] Y. Su, D. Jayaraman, and K. Grauman (2016) Pano2Vid: automatic cinematography for watching videos. In ACCV, Cited by: §1, §2, §3.1.
  • [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §2.
  • [47] F. Wang, H. Hu, H. Cheng, J. Lin, S. Yang, M. Shih, H. Chu, and M. Sun (2018)

    Self-supervised learning of depth and camera motion from

    videos
    .
    In ACCV, Cited by: §7.2.
  • [48] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §2.
  • [49] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In CVPR, Cited by: §2, §2, §4.1, §4.2, §4.
  • [50] L. Yu, E. Park, A. C. Berg, and T. L. Berg (2015) Visual madlibs: fill in the blank description generation and question answering. In ICCV, Cited by: §5.2.
  • [51] K. Zeng, T. Chen, C. Chuang, Y. Liao, J. C. Niebles, and M. Sun (2017) Leveraging video descriptions to learn video question answering.. In AAAI, Cited by: §2.
  • [52] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei (2016) Visual7w: grounded question answering in images. In CVPR, Cited by: §1, §2.