We are interested in the problem of visual question answering (VQA), where an algorithm is presented with an image and a question that is formulated in natural language and relates to the contents of the image. The goal of this task is to get the algorithm to correctly answer the question. The VQA task has recently received significant attention from the computer vision community, in particular because obtaining high accuracies would presumably require precise understanding of both natural language as well as visual stimuli. In addition to serving as a milestone towards visual intelligence, there are practical applications such as development of tools for the visually impaired.
The problem of VQA is challenging due to the complex interplay between the language and visual modalities. On one hand, VQA algorithms must be able to parse and interpret the input question, which is provided in natural language [8, 14, 9]. This may potentially involve understanding of nouns, verbs and other linguistic elements, as well as their visual significance. On the other hand, the algorithms must analyze the image to identify and recognize the visual elements relevant to the question. Furthermore, some questions may refer directly to the contents of the image, but may require external, common sense knowledge to be answered correctly. Finally, the algorithms should generate a textual output in natural language that correctly answers the input visual question. In spite of the recent research efforts to address these challenges, the problem remains largely unsolved .
We are particularly interested in giving VQA algorithms the ability to identify the visual elements that are relevant to the question. In the VQA literature, such ability has been implemented by attention mechanisms. Such attention mechanisms generate a heatmap over the input image, which highlights the regions of the image that lead to the answer. These heatmaps are interpreted as groundings of the answer to the most relevant areas of the image. Generally, these mechanisms have either been considered as latent variables for which there is no supervision, or have been treated as output variables that receive direct supervision from human annotations. Unfortunately, both of these approaches have disadvantages. First, unsupervised training of attention tends to lead to models that cannot ground their decision in the image in a human interpretable manner. Second, supervised training of attention is difficult and expensive: human annotators may consider different regions to be relevant for the question at hand, which entails ambiguity and increased annotation cost. Our goal is to leverage the best of both worlds by providing VQA algorithms with interpretable grounding of their answers, without the need of direct and explicit manual annotation of attention.
From a practical point of view, as autonomous machines are increasingly finding real world applications, there is an increasing need to provide them with suitable capabilities to explain their decisions. However, in most applications, including VQA, current state-of-the-art techniques operate as black-box models that are usually trained using a discriminative approach. Similarly to , in this work we show that, in the context of VQA, such approaches lead to internal representations that do not capture the underlying semantic relations between textual questions and visual information. Consequently, as we show in this work, current state-of-the-art approaches for VQA are not able to support their answers with a suitable interpretable representation.
In this work, we introduce a methodology that provides VQA algorithms with the ability to generate human interpretable attention maps which effectively ground the answer to the relevant image regions. We accomplish this by leveraging region descriptions and object annotations available in the Visual Genome dataset, and using these to automatically construct attention maps that can be used for attention supervision, instead of requiring human annotators to manually provide grounding labels. Our framework achieves competitive state-of-the-art VQA performance, while generating visual groundings that outperform other algorithms that use human annotated attention during training.
The contributions of this paper are: (1) we introduce a mechanism to automatically obtain meaningful attention supervision from both region descriptions and object annotations in the Visual Genome dataset; (2) we show that by using the prediction of region and object label attention maps as auxiliary tasks in a VQA application, it is possible to obtain more interpretable intermediate representations. (3) we experimentally demonstrate state-of-the-art performances in VQA benchmarks as well as visual grounding that closely matches human attention annotations.
2 Related Work
Since its introduction [8, 14, 9], the VQA problem has attracted an increasing interest . Its multimodal nature and more precise evaluation protocol than alternative multimodal scenarios, such as image captioning, help to explain this interest. Furthermore, the proliferation of suitable datasets and potential applications, are also key elements behind this increasing activity. Most state-of-the-art methods follow a joint embedding approach, where deep models are used to project the textual question and visual input to a joint feature space that is then used to build the answer. Furthermore, most modern approaches pose VQA as a classification problem, where classes correspond to a set of pre-defined candidate answers. As an example, most entries to the VQA challenge  select as output classes the most common 3000 answers in this dataset, which account for 92% of the instances in the validation set.
The strategy to combine the textual and visual embeddings and the underlying structure of the deep model are key design aspects that differentiate previous works. Antol et al.  propose an element-wise multiplication between image and question embeddings to generate spatial attention map. Fukui et al.  propose multimodal compact bilinear pooling (MCB) to efficiently implement an outer product operator that combines visual and textual representations. Yu et al.  extend this pooling scheme by introducing a multi-modal factorized bilinear pooling approach (MFB) that improves the representational capacity of the bilinear operator. They achieve this by adding an initial step that efficiently expands the textual and visual embeddings to a high-dimensional space. In terms of structural innovations, Noh et al.  embed the textual question as an intermediate dynamic bilinear layer of a ConvNet that processes the visual information. Andreas et al.  propose a model that learns a set of task-specific neural modules that are jointly trained to answer visual questions.
Following the successful introduction of soft attention in neural machine translation applications, most modern VQA methods also incorporate a similar mechanism. The common approach is to use a one-way attention scheme, where the embedding of the question is used to generate a set of attention coefficients over a set of predefined image regions. These coefficients are then used to weight the embedding of the image regions to obtain a suitable descriptor [19, 21, 6, 25, 26]. More elaborated forms of attention has also been proposed. Xu and Saenko  suggest use word-level embedding to generate attention. Yang et al.  iterates the application of a soft-attention mechanism over the visual input as a way to progressively refine the location of relevant cues to answer the question. Lu et al.  proposes a bidirectional co-attention mechanism that besides the question guided visual attention, also incorporates a visual guided attention over the input question.
In all the previous cases, the attention mechanism is applied using an unsupervised scheme, where attention coefficients are considered as latent variables. Recently, there have been also interest on including a supervised attention scheme to the VQA problem [5, 7, 18]. Das et al.  compare the image areas selected by humans and state-of-the-art VQA techniques to answer the same visual question. To achieve this, they collect the VQA human attention dataset (VQA-HAT), a large dataset of human attention maps built by asking humans to select images areas relevant to answer questions from the VQA dataset . Interestingly, this study concludes that current machine-generated attention maps exhibit a poor correlation with respect to the human counterpart, suggesting that humans use different visual cues to answer the questions. At a more fundamental level, this suggests that the discriminative nature of most current VQA systems does not effectively constraint the attention modules, leading to the encoding of discriminative cues instead of the underlying semantic that relates a given question-answer pair. Our findings in this work support this hypothesis.
apply a more structured approach to identify the image areas used by humans to answer visual questions. For VQA pairs associated to images in the COCO dataset, they ask humans to select the segmented areas in COCO images that are relevant to answer each question. Afterwards, they use these areas as labels to train a deep learning model that is able to identify attention features. By augmenting a standard VQA technique with these attention features, they are able to achieve a small boost in performance. Closely related to our approach, Qiao et al. use the attention labels in the VQA-HAT dataset to train an attention proposal network that is able to predict image areas relevant to answer a visual question. This network generates a set of attention proposals for each image in the VQA dataset, which are used as labels to supervise attention in the VQA model. This strategy results in a small boost in performance compared with a non-attentional strategy. In contrast to our approach, these previous works are based on a supervised attention scheme that does not consider an automatic mechanism to obtain the attention labels. Instead, they rely on human annotated groundings as attention supervision. Furthermore, they differ from our work in the method to integrate attention labels to a VQA model.
3 VQA Model Structure
Figure 2 shows the main pipeline of our VQA model. We mostly build upon the MCB model in , which exemplifies current state-of-the-art techniques for this problem. Our main innovation to this model is the addition of an Attention Supervision Module that incorporates visual grounding as an auxiliary task. Next we describe the main modules behind this model.
Question Attention Module: Questions are tokenized and passed through an embedding layer, followed by an LSTM layer that generates the question features , where is the maximum number of words in the tokenized version of the question and is the dimensionality of the hidden state of the LSTM. Additionally, following , a question attention mechanism is added that generates question attention coefficients , where is the so-called number of “glimpses”. The purpose of is to allow the model to predict multiple attention maps so as to increase its expressiveness. Here, we use . The weighted question features are then computed using a soft attention mechanism , which is essentially a weighted sum of the word features followed by a concatenation according to .
Image Attention Module:
Images are passed through an embedding layer consisting of a pre-trained ConvNet model, such as Resnet pretrained with the ImageNet dataset. This generates image features , where , and are depth, height, and width of the extracted feature maps. Fusion Module I is then used to generate a set of image attention coefficients. First, question features are tiled as the same spatial shape of . Afterwards, the fusion module models the joint relationship between questions and images, mapping them to a common space . In the simplest case, one can implement the fusion module using either concatenation or Hadamard product , but more effective pooling schemes can be applied [6, 11, 25, 26]. The design choice of the fusion module remains an on-going research topic. In general, it should both effectively capture the latent relationship between multi-modal features meanwhile be easy to optimize. The fusion results are then passed through an attention module that computes the visual attention coefficient , with which we can obtain attention-weighted visual features . Again, is the number of “glimpses”, where we use .
Classification Module: Using the compact representation of questions and visual information , the classification module applies first the Fusion Module II that provides the feature representation of answers , where
is the latent answer space. Afterwards, it computes the logits over a set of predefined candidate answers. Following previous work, we use as candidate outputs the top 3000 most frequent answers in the VQA dataset. At the end of this process, we obtain the highest scoring answer .
Attention Supervision Module: As a main novelty of the VQA model, we add an Image Attention Supervision Module as an auxiliary classification task, where ground-truth visual grounding labels are used to guide the model to focus on meaningful parts of the image to answer each question. To do that, we simply treat the generated attention coefficients
as a probability distribution, and then compare it with the ground-truth using KL-divergence. Interestingly, we introduce two attention maps, corresponding to relevant region-level and object-level groundings, as shown in Figure3. Sections 4 and 5 provide details about our proposed method to obtain the attention labels and to train the resulting model, respectively.
4 Mining Attention Supervision from Visual Genome
Visual Genome (VG)  includes the largest VQA dataset currently available, which consists of 1.7M QA pairs. Furthermore, for each of its more than 100K images, VG also provides region and object annotations by means of bounding boxes. In terms of visual grounding, these region and object annotations provide complementary information. As an example, as shown in Figure 3, for questions related to interaction between objects, region annotations result highly relevant. In contrast, for questions related to properties of specific objects, object annotations result more valuable. Consequently, in this section we present a method to automatically select region and object annotations from VG that can be used as labels to implement visual grounding as an auxiliary task for VQA.
|(a) Region-level grounding.|
|Q: What are the people doing? Ans: Talking.|
|(b) Object-level grounding.|
|Q: How many people are there? Ans: Two.|
For region annotations, we propose a simple heuristic to mine visual groundings: for eachwe enumerate all the region descriptions of and pick the description that has the most (at least two) overlapped informative words with and . Informative words are all nouns and verbs, where two informative words are matched if at least one of the following conditions is met: (1) Their raw text as they appear in or are the same; (2) Their lemmatizations (using NLTK ) are the same; (3) Their synsets in WordNet  are the same; (4) Their aliases (provided from VG) are the same. We refer to the resulting labels as region-level groundings. Figure 3(a) illustrates an example of a region-level grounding.
In terms of object annotations, for each image in a triplet we select the bounding box of an object as a valid grounding label, if the object name matches one of the informative nouns in or . To score each match, we use the same criteria as region-level groundings. Additionally, if a triplet has a valid region grounding, each corresponding object-level grounding must be inside this region to be accepted as valid. As a further refinement, selected objects grounding are passed through an intersection over union filter to account for the fact that VG usually includes multiple labels for the same object instance. As a final consideration, for questions related to counting, region-level groundings are discarded after the corresponding object-level groundings are extracted. We refer to the resulting labels as object-level groundings. Figure 3(b) illustrates an example of an object-level grounding.
As a result, combining both region-level and object-level groundings, about 700K out of 1M triplets in VG end up with valid grounding labels. We will make these labels publicly available.
5 Implementation Details
We build the attention supervision on top of the open-sourced implementation of MCB  and MFB . Similar to them, We extract the image feature from res5c layer of Resnet-152, resulting in spatial grid (, , ). We construct our ground-truth visual grounding labels to be glimpse maps per QA pair, where the first map is object-level grounding and the second map is region-level grounding, as discussed in Section 4. Let be the coordinate of selected object bounding box in the grounding labels, then the mined object-level attention maps are:
where is the indicator function. Similarly, the region-level attention maps are:
Afterwards, and are spatially L1-normalized to represent probabilities and concatenated to form .
The model is trained using a multi-task loss,
where denotes cross-entropy and denotes KL-divergence. corresponds to the learned parameters. is a scalar that weights the loss terms. This scalar decays as a function of the iteration number . In particular, we choose to use a cosine-decay function:
This is motivated by the fact that the visual grounding labels have some level of subjectivity. As an example, Figure 4 (second row) shows a case where the learned attention seems more accurate than the VQA-HAT ground truth. Hence, as the model learns suitable parameter values, we gradually loose the penalty on the attention maps to provide more freedom to the model to selectively decide what attention to use. It is important to note that, for training samples in VQA-2.0 or VG that do not have region-level or object-level grounding labels, in Equation 3, so the loss is reduced to the classification term only. In our experiment, is calibrated for each tested model based on the number of training steps. In particular, we choose for all MCB models and for others.
VQA-2.0: The VQA-2.0 dataset  consists of 204721 images, with a total of 1.1M questions and 10 crowd-sourced answers per question. There are more than 20 question types, covering a variety of topics and free-form answers. The dataset is split into training (82K images and 443K questions), validation (40K images and 214K questions), and testing (81K images and 448K questions) sets. The task is to predict a correct answer given a corresponding image-question pair . As a main advantage with respect to version 1.0 , for every question VQA-2.0 includes complementary images that lead to different answers, reducing language bias by forcing the model to use the visual information.
Visual Genome: The Visual Genome (VG) dataset  contains 108077 images, with an average of 17 QA pairs per image. We follow the processing scheme from , where non-informative words in the questions and answers such as “a” and “is” are removed. Afterwards, triplets with answers to be single keyword and overlapped with VQA-2.0 dataset are included in our training set. This adds 97697 images and about 1 million questions to the training set. Besides the VQA data, VG also provides on average 50 region descriptions and 30 object instances per image. Each region/object is annotated by one sentence/phrase description and bounding box coordinates.
VQA-HAT: VQA-HAT dataset  contains 58475 human visual attention heat (HAT) maps for triplets in VQA-1.0 training set. Annotators were shown a blurred image, a pair and were asked to “scratch” the image until they believe someone else can answer the question by looking at the blurred image and the sharpened area. The authors also collect HAT maps for VQA-1.0 validation sets, where each of the 1374 were labeled by three different annotators, so one can compare the level of agreement among labels. We use VQA-HAT to evaluate visual grounding performance, by comparing the rank-correlation between human attention and model attention, as in [5, 17].
VQA-X: VQA-X dataset  contains 2000 labeled attention maps in VQA-2.0 validation sets. In contrast to VQA-HAT, VQA-X attention maps are in the form of instance segmentations, where annotators were asked to segment objects and/or regions that most prominently justify the answer. Hence the attentions are more specific and localized. We use VQA-X to evaluate visual grounding performance by comparing the rank-correlation, as in [5, 17].
|Attn-MCB, =1 (ours)||0.580||0.396||60.51|
|VQA-HAT Ground Truth||MFH||Attn-MFH (Ours)|
|Q: Is the computer on or off? Ans: on|
|Q: What color is the inside of the cats ears? Ans: pink|
|Q: How many of these animals are there? Ans: 2|
We evaluate the performance of our proposed method using two criteria: i) rank-correlation  to evaluate visual grounding and ii) accuracy to evaluate question answering. Intuitively, rank-correlation measures the similarity between human and model attention maps under a rank-based metric. A high rank-correlation means that the model is ‘looking at’ image areas that agree to the visual information used by a human to answer the same question. In terms of accuracy of a predicted answer is evaluated by:
Table 1 reports our main results. Our models are built on top of prior works with the additional Attention Supervision Module as described in Section 3. Specifically, we denote by Attn-* our adaptation of the respective model by including our Attention Supervision Module. We highlight that MCB model is the winner of VQA challenge 2016 and MFH model is the best single model in VQA challenge 2017. In Table 1, we can observe that our proposed model achieves a significantly boost on rank-correlation with respect to human attention. Furthermore, our model outperforms alternative state-of-art techniques in terms of accuracy in answer prediction. Specifically, the rank-correlation for MFH model increases by 36.4% when is evaluated in VQA-HAT dataset and 7.7% when is evaluated in VQA-X. This indicates that our proposed methods enable VQA models to provide more meaningful and interpretable results by generating more accurate visual grounding.
Table 1 also reports the result of an experiment where the decaying factor in Equation 4 is fixed to a value of 1. In this case, the model is able to achieve higher rank-correlation, but accuracy drops by 2%. We observe that as training proceeds, attention loss becomes dominant in the final training steps, which affects the accuracy of the classification module.
Figure 4 shows qualitative results of the resulting visual grounding, including also a comparison with respect to no-attn model.
In this work we have proposed a new method that is able to slightly outperform current state-of-the-art VQA systems, while also providing interpretable representations in the form of an explicitly trainable visual attention mechanism. Specifically, as a main result, our experiments provide evidence that the generated visual groundings achieve high correlation with respect to human-provided attention annotations, outperforming the correlation scores of previous works by a large margin.
As further contributions, we highlight two relevant insides of the proposed approach. On one side, by using attention labels as an auxiliary task, the proposed approach demonstrates that is able to constraint the internal representation of the model in such a way that it fosters the encoding of interpretable representations of the underlying relations between the textual question and input image. On other side, the proposed approach demonstrates a method to leverage existing datasets with region descriptions and object labels to effectively supervise the attention mechanism in VQA applications, avoiding costly human labeling.
As future work, we believe that the superior visual grounding provided by the proposed method can play a relevant role to generate natural language explanations to justify the answer to a given visual question. This scenario will help to demonstrate the relevance of our technique as a tool to increase the capabilities of AI based technologies to explain their decisions.
Acknowledgements: This work was partially funded by Oppo, Panasonic and the Millennium Institute for Foundational Research on Data.
-  P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and VQA. CoRR, abs/1707.07998, 2017.
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein.
Neural module networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
-  S. Bird and E. Loper. Nltk: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 31. Association for Computational Linguistics, 2004.
-  A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? CoRR, abs/1606.05589, 2016.
-  A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. CoRR, abs/1606.01847, 2016.
-  C. Gan, Y. Li, H. Li, C. Sun, and B. Gong. Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation. In Proc. IEEE Int. Conf. Comp. Vis, volume 3, 2017.
-  D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  J. Kim, K. W. On, W. Lim, J. Kim, J. Ha, and B. Zhang. Hadamard product for low-rank bilinear pooling. CoRR, abs/1610.04325, 2016.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
-  M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in neural information processing systems, pages 1682–1690, 2014.
-  G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
H. Noh, P. Hongsuck Seo, and B. Han.
Image question answering using convolutional neural network with dynamic parameter prediction.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 30–38, 2016.
-  D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. CoRR, abs/1802.08129, 2018.
-  T. Qiao, J. Dong, and D. Xu. Exploring human-like attention supervision in visual question answering. In AAAI, 2018.
-  K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4613–4621, 2016.
-  C. Spearman. The proof and measurement of association between two things. The American journal of psychology, 15(1):72–101, 1904.
-  D. Teney, P. Anderson, X. He, and A. v. d. Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.
-  Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 163:21–40, 2017.
-  H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer, 2016.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
-  Z. Yu, J. Yu, J. Fan, and D. Tao. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV, 2017.
-  Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao. Beyond bilinear: Generalized multi-modal factorized high-order pooling for visual question answering. CoRR, abs/1708.03619, 2017.