Recently, Visual Question Answering (VQA) antol2015vqa
has emerged as a challenging task that requires artificial intelligence (AI) systems to compute answers by jointly analyzing both natural language questions and visual content. The state-of-the-art VQA systemsfukui2016multimodal ; anderson2017bottom ; vqa-cp ; andreas2016neural ; hu2018explainable ; yang2016stacked ; wu2018joint ; selvaraju2019taking ; jiang2018pythia ; kim2018bilinear ; ramakrishnan2018overcoming achieve high performance when the training and test question-answer (QA) pairs are sampled from the same distribution. However, most of these systems fail to generalize to test data with a substantially different QA distribution. In particular, their performance drops catastrophically on the recently introduced Visual Question Answering under Changing Priors (VQA-CP) vqa-cp dataset. The strong language priors encourage systems to blindly capture superficial statistical correlations in the training QA pairs and simply output the most common answers, instead of reasoning about the relevant image regions on which a human would focus. For example, since about 40% of questions that begin with “what sport” have the answer “tennis”, systems tend to learn to output “tennis” for these questions regardless of image content.
A number of recent VQA systems trott2017interpretable ; zhang2019interpretable ; selvaraju2019taking ; qiao2018exploring learn to not only predict correct answers but also be “right for the right reasons” ross2017right ; selvaraju2019taking . These systems are trained to encourage the network to focus on regions in the image that humans have somehow annotated as important (which we will refer to as “important regions.”). However, many times the network also focuses on these important regions even when it produces a wrong answer. Previous approaches do nothing to actively discourage this phenomenon, which we have found occurs quite frequently. For example, as shown in Figure 1, we ask the VQA system “What is the man eating?”. The baseline system predicts “hot dog” but focuses on the banana because hot dog appears much more frequently in the training data. What’s worse, this error is hard to detect when only analyzing the correct answer “banana” that has been successfully grounded in the image.
To address this issue, we present a “self-critical” approach that directly criticizes incorrect answers’ sensitivity to the important regions. First, for each QA, we determine the important region that most influences the network’s prediction of the correct answer. We then penalize the network for focusing on this region when its predicted answer for this question is wrong.
Our self-critical approach is end-to-end trainable and only requires that the base VQA system be differentiable to the visual content, and thus can be applied to most current state-of-the-art systems. We investigated three approaches to determining important regions. First, like the previous work trott2017interpretable ; zhang2019interpretable ; selvaraju2019taking ; qiao2018exploring , we used regions that humans have explicitly marked as important. However, this requires a fair bit of extra human effort to provide such detailed annotations. So we also explored using human textual VQA explanations from the VQA-X park2018multimodal dataset to determine important objects which are then grounded to important regions in the image. Finally, we tried determining important regions by only using objects mentioned in the question or answer and grounding them in the image, which requires no additional human annotation of the VQA training data.
We evaluate our approach using the UpDn VQA system anderson2017bottom on the VQA-CP datasetvqa-cp and achieve a new state-of-the-art performance (currently 47.7%): 49.5% overall score with VQA-X park2018multimodal textual explanations, 49.1 % with VQA-HAT das2017human visual explanations and 48.5% using just mentioned objects in the questions and answers.
2 Related Work
2.1 Human Explanations for VQA
There are two main kinds of human explanations available for the most popular VQA dataset antol2015vqa , visual and textual explanations. The VQA-HAT dataset das2017human is a well-known visual explanation dataset that collects human attention maps by giving human experts blurred images and asking them to determine where to deblur in order to answer the given visual questions correctly. Alternatively, park2018multimodal present the VQA-X dataset that associates a textual explanation with each QA pair which a human has provided to justify an answer to a given question. In this work, we utilize both of these kinds of explanations to provide the important regions.
2.2 Language Priors in VQA
Language priors vqa-cp ; goyal2017making in VQA refer to the fact that questions and the answers are highly correlated. For instance, questions that begin with “How many” are usually answered by either two or three. These language priors allow VQA systems to take a shortcut when answering questions by only focusing on the questions without reasoning about the visual content. In order to prevent this circumstance, VQA v2 antol2015vqa balances the answer distribution so that there exist at least two similar images with different answers for each question. Recently, vqa-cp introduce a diagnostic reconfiguration of the VQA v2 dataset called VQA-CP where the distribution of the QA pairs in the training set are significantly different with those in the test set. Most state-of-the-art VQA systems are found to highly rely on language priors and experience a catastrophic performance drop on VQA-CP. We evaluate our approach on VQA-CP in order to demonstrate that it generalizes better and is less sensitive to distribution changes.
2.3 Improving VQA using Human Explanations
A desired property for VQA systems is to not only infer the correct answers to visual questions but also base the answer on image regions that a human believes are important,
right for the right reasons. The VQA systems that address this issue can be classified into two categories. The first trend is to build a system whose model is inherently interpretable. For example, GVQAvqa-cp explicitly disentangles the vision and language components by introducing a separate visual concept verifier and answer cluster classifiers. The other trend is to align a systems’ explanation to human experts’ explanations for the correct answers. zhang2019interpretable ; qiao2018exploring align the internal attention weights over the image to the human attention maps. The work most related to ours is HINT selvaraju2019taking , which enforces the system’s gradient-based importance scores for each detected object to have the same rankings as its human importance scores. In contrast to prior work, our approach not only encourages the systems to be sensitive to the important regions identified by humans, but also decrease the incorrect answers’ sensitivity to these regions.
In this section, we first introduce our base Bottom-up Top-down (UpDn) VQA system111Core building blocks in the VQA challenge wining entries in the last two years.anderson2017bottom . Then, we describe the method to construct a proposal object set that covers the most influential objects on which a human would focus when answering the question.
3.1 Bottom-Up Top-Down VQA
A large number of previous VQA systems fukui2016multimodal ; ben2017mutan ; ramakrishnan2018overcoming utilize a trainable Top-Down attention mechanism over convolutional features to recognize relevant image regions. anderson2017bottom introduced a complementary bottom-up attention that first detects common objects and attributes so that the top down attention can directly model the contribution of higher level concepts. This UpDn approach is utilized in lots of recent works selvaraju2019taking ; wu2018faithful ; jiang2018pythia ; xie2019visual ; shah2019cycle and significantly improves VQA performance.
Technically, on the vision side, for each image, UpDn systems first extract a visual feature set = for each image whose element
is a feature vector for the-th detected object. On the language side, UpDn systems sequentially encode each question to produce a question vector q using a standard single-layer GRU cho2014learning denoted by , . Let denote the answer prediction operator that takes both visual features and question features as input and predicts the confidence for each answer in the answer candidate set ,
. The VQA task is framed as a multi-label regression problem with the gold-standard soft scores as targets in order to be consistent with the evaluation metric. In particular, the standard binary cross entropy lossis used to supervise the sigmoid-normalized outputs.
3.2 Proposal Influential Object Set Construction
Our approach ideally requires identifying important regions that a human considers most critical in answering the question. However, directly obtaining such a clear set of influential object from either visual or textual explanations is hard, as the visual explanations also highlight the neighbour objects around the most influential one and grounding textual explanations in images is still an active research field. We relax this requirement by identifying a proposed set of influential objects for each QA pair. This set may we noisy and contain some irrelevant objects, but we assume that it at least includes the most relevant object. As previously mentioned, we explore three separate methods for constructing this proposal set, as described below:
Construction from Visual Explanations. Following HINTselvaraju2019taking , we use the VQA-HAT dataset das2017human as the visual explanation source. HAT maps contains a total of image-question pairs, corresponding to approximately 9% of the VQA-CP training and test set. We also inherit HINT’s object scoring system that is based on the normalized human attention map energy inside the proposal box relative to the normalized energy outside the box. We score each detected object from the bottom-up attention and build the potential object set by selecting the top objects.
Construction from Textual Explanations. Recently, park2018multimodal introduced a textual explanation dataset that annotates image-question pairs in total, corresponding to 5% of the entire VQA-CP dataset. To extract the potential object set, we first assign part-of-speech (POS) tags to each word in the explanation using the spaCy POS tagger spacy2
and extract the nouns in the sentence. Then, we select the detected objects whose cosine similarity between the Glove embeddingspennington2014glove of their category names and any of the extracted nouns’ is greater than . Finally, we select the objects with highest similarity.
Construction from Questions and Answers. Since the above explanations may not be available in other datasets, we also consider a simple way to extract the proposal object set from just the training QA pairs alone. The method is quite similar to the way we construct the potential set from textual explanations. The only difference is that instead of parsing the explanations, we parse the QA pairs and extract nouns from them.
In this section, we present our self-critical approach to prevent the most common answer from dominating the correct answer given the proposal sets of influential objects. Figure 2 shows the overview of our approach. Besides the UpDn VQA system (left top block), our approach contains two other components, we first recognize and strengthen the most influential objects (left bottom block), and then we criticize incorrect answers that are more highly ranked than the correct answer and try to make them less sensitive to these key objects (right block). As recent research suggested that gradient-based methods more faithfully present the models’ decision making process selvaraju2019taking ; zhang2018top ; wu2018dynamic ; jain2019attention , we use the modified GradCAM selvaraju2017grad to compute the answer ’s sensitivity to the -th object features as shown in Eq. 1:
where denotes a vector with all 1’s.
4.1 Recognizing and Strengthening Influential Objects
Given a proposal object set and the entire detected object set , we identify the object that the correct answer is most sensitive to and further strengthen its sensitivity. We first introduce a sensitivity violation term for answer and the -th and -th object features and as the amount of sensitivity that surpass shown in Eq. 2.
Based on the assumption that the proposal set contains at least one influential object that a human would use to infer the answer, we impose the constraint that the most sensitive object in the proposal set should not be less sensitive than any object outside the proposal set. Therefore, we introduce the influence strengthen loss in Eq. 3:
where the denotes the ground truth answer. The key differences between our influence strengthen loss and the ranking-based HINT loss are that (1) we relax the unnecessary constraint that the objects should follow the exact human ranking, and (2) it is easier to adapt to different types of explanation ( textual explanation) where such detailed rankings are not available.
4.2 Criticizing Incorrect Dominant Answers
Next, for the incorrect answers ranked higher than the correct answer, we attempt to decrease the sensitivity of the influential objects. For example, in VQA-CP, bedrooms are the most common room type. Therefore, during testing, systems frequently incorrectly classify bathrooms (which are rare in the training data) as bedrooms. Since humans identify a sink as an influential object when identifying bathrooms, we want to decrease the influence on sinks on concluding bedroom.
In order to address this issue, we design a self-critical objective to criticize the VQA systems’ incorrect but competitive decisions based on the most influential object to which the correct answer is most sensitive as defined in Eq. 4.
Specifically, we extract a bucket of at most predictions with higher confidence than the correct answer = and utilize the proposed self-critical loss to directly minimize the weighted sensitivities of the answers in the bucket to the selected most influential object, as shown in Eq. 5.
where denotes the ground truth answer. Because several answer candidates could be similar ( and ), we weight the sensitivity gaps in Eq. 5 by the cosine distance between the answers’ - Glove embeddings pennington2014glove , . In the multi-word answer case, the Glove embeddings of these answers are computed as the sum of the individual word’s Glove embeddings.
4.3 Implementation and Training Details
In this section, we describe the detailed implementation and training procedure of our self-critical approach to VQA.
Training Details. We first pre-train our base UpDn VQA system on the VQA-CP training set using standard VQA loss (binary cross entropy loss with soft scores as supervision) with the Adam optimizer kingma2014adam
for 10 epochs. As suggested inteney2017tips , the learning rate is fixed to 10e-3 with a batch size of 512 during the pre-training process, and we use hidden units in the base UpDn VQA system. After that, we finetune the VQA system with the joint loss for another epochs using the Adam optimizer with a learning rate of 10e-4 and a batch size of 512.
In our experiments, we also vary the self-critical loss weight to analyze the impact of the self-critical loss. The bucket size of the competitive answers is set to 5 because we observed that the top- overall score of the pre-trained system on the VQA-CP dataset achieves 80.4%, and increasing the bucket size only marginally improves the score.
Implementation. We implemented our approach on top of the publicly available reimplementation222https://github.com/hengyuan-hu/bottom-up-attention-vqa of the original UpDn system. The base system utilizes a Faster R-CNN head girshick2015fast in conjunction with a ResNet-101 base network he2016deep as the object detection module. The detection head is pre-trained on the Visual Genome dataset krishna2017visual and is capable of detecting objects categories and attributes. UpDn takes the final detection outputs and performs non-maximum suppression (NMS) for each object category using an IoU threshold of . Then, the convolutional features for the top objects are extracted for each image as the visual features, a dimensional vector for each object. For question embedding, following anderson2017bottom , we perform standard text pre-processing and tokenization. In particular, questions are first converted to lower case and then trimmed to a maximum of words, and the words that appear less than times are replaced with an “unk” token. A single layer GRU cho2014learning is used to sequentially process the word vectors and produce a sentential representation for the pre-processed question. We also use Glove vectors pennington2014glove to initialize the word embedding matrix when embedding the questions.
5 Experimental Results
We conduct experiments mainly on the VQA-CP (Visual Question Answering with Changing Priors) vqa-cp dataset where the QA pairs in the training data and test data have significantly different distributions. We also present experimental results on the VQA v2 validation set for completeness. We compare our self-critical system’s VQA performance with the start-of-the-art systems via the standard evaluation metric. After that, we perform ablation studies to verify the contribution of strengthening the influential objects and criticizing competitive answers. Finally, we show some qualitative examples to illustrate the effectiveness of criticizing the false answers’ sensitivity.
|Expl.||VQA-CP v2 test||VQA v2 val|
5.1 VQA Performance on VQA-CP and VQA v2 dataset
We first report the quantitative results of the VQA generalization task and compare our results with the state-of-the-art methods in this section. We also report our system’s performance on the balanced VQA v2 validation set for completeness.
As demonstrated in Table 1, our system significantly outperforms other state-of-the-art systems (e.g. HINT selvaraju2019taking ) by overall scores when using the same amount of human visual explanation (VQA-HAT), which indicates the effectiveness of directly criticizing the competitive answers’ sensitivity to the most influential objects. We also observe that using human textual explanations as supervision is somewhat more effective. With only about half the number of explanations compared to VQA-HAT, these textual explanations help the VQA system to perform better by 0.3% overall score, achieving the new state-of-the-art score of .
Besides, we observe that our self-critical objective helps the VQA system especially in the ’Yes/No’ and ’Other’ question categories; however, it does not do well in the ’Num’ category. This is understandable because counting problems are generally harder than the other two types, and requires the VQA system to jointly consider all the objects. Therefore, criticizing only the most sensitive ones does not improve the performance.
In VQA v2 test dataset, the VQA systems trained using our self-critical loss perform competitively well compared with previous approaches. This indicates that criticizing the wrong answers’ sensitivities does not hurt performance substantially even when the training and test data have the same distribution.
5.2 Ablation Study on the Self-Critical Loss
In this section, we evaluate the impact of varying the weight of the self-critical loss on the VQA-CP test data using VQA-HAT visual explanations. Results are in Table 2. We observe that without to strengthen the most influential objects for the correct answer, our self-critical approach helps the UpDn VQA system to improve on the overall score. Combining with the loss, our approach which includes the self-critical loss () sets a new state-of-the-art score () on the VQA-CP test set using visual explanations. We also notice that our approach is robust to changes in the weigth of the self-critical loss and consistently improves VQA performance for a wide range of loss weights as demonstrated in Table 2.
|Expl.||VQA-CP v2 test|
5.3 Effectiveness of Criticizing False Sensitivity
In this section, we quantitatively evaluate the effectiveness of the proposed self-critical objective. In particular, we evaluate the fraction of false sensitivity where the predicted incorrect answer’s sensitivity to the influential object (to which the correct answer is most sensitive) is greater than the correct answer’s sensitivity.
We formally define the false sensitivity rate in Eq. 6:
where denote the function that returns 1 if the condition is satisfied and returns 0 otherwise.
For the original UpDn VQA system, we observe a false sensitivity rate of 35.5% among all the test QA pairs in the VQA-CP data. After the self-critical training, the false sensitivity rate reduces to 20.4% using the VQA-HAT visual explanations and to 19.6% using VQA-X textual explanations. This indicates that false sensitivity is a common problem in VQA systems and indicates the utility of criticizing them.
Some sampled examples of how our self-critical approach mitigates false sensitivity are shown in Figure 3. We observe that the most influential object (for the correct answer) becomes more influential, which we attribute to the influence strengthening part. More importantly, we observe that this object’s influence on the incorrect answer decreases and sometimes falls below other objects.
|UpDn||UpDn + QA||UpDn + HAT||UpDn + VQA-X|
6 Conclusion and Future Work
In this work, we have explored how to improve VQA performance by criticizing the sensitivity of incorrect answers to the most influential object for the correct answer. Our “self critical” approach helps VQA systems generalize to test data where the distribution of question-answer pairs is significantly different from the training data. The influential objects are selected from a proposal set extracted from human visual or textual explanations, or simply from the mentioned objects in the questions and answers. Our approach outperforms the state-of-the-art VQA systems on the VQA-CP dataset by a clear margin even without human explanations as additional supervision. In the future, we would like to combine the visual and the textual explanations together to better train VQA systems. This is difficult because the proposal object sets of these two types of explanations contain different types of noise (i.e. question-irrelevant objects), and therefore different biases.
- (1) A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In , 2018.
- (2) P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-Up and Top-Down Attention for Image Captioning and VQA. In CVPR, volume 3, page 6, 2018.
- (3) J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016.
- (4) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
- (5) H. Ben-Younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2612–2620, 2017.
- (6) K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078, 2014.
- (7) A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100, 2017.
- (8) A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP, 2016.
- (9) Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017.
- (10) K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
M. Honnibal and I. Montani.
spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.To appear, 2017.
- (12) R. Hu, J. Andreas, T. Darrell, and K. Saenko. Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 53–69, 2018.
- (13) S. Jain and B. C. Wallace. Attention is not explanation. arXiv preprint arXiv:1902.10186, 2019.
- (14) Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956, 2018.
- (15) J.-H. Kim, J. Jun, and B.-T. Zhang. Bilinear attention networks. In Advances in Neural Information Processing Systems, pages 1564–1574, 2018.
- (16) D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
- (17) R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
- (18) D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach. Multimodal Explanations: Justifying Decisions and Pointing to the Evidence. In CVPR, 2018.
J. Pennington, R. Socher, and C. Manning.
Glove: Global Vectors for Word Representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- (20) T. Qiao, J. Dong, and D. Xu. Exploring human-like attention supervision in visual question answering. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- (21) S. Ramakrishnan, A. Agrawal, and S. Lee. Overcoming language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems, pages 1541–1551, 2018.
- (22) S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In NIPS, pages 91–99, 2015.
- (23) A. S. Ross, M. C. Hughes, and F. Doshi-Velez. Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717, 2017.
- (24) R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In ICCV, pages 618–626, 2017.
- (25) R. R. Selvaraju, S. Lee, Y. Shen, H. Jin, D. Batra, and D. Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. arXiv preprint arXiv:1902.03751, 2019.
- (26) M. Shah, X. Chen, M. Rohrbach, and D. Parikh. Cycle-consistency for robust visual question answering. arXiv preprint arXiv:1902.05660, 2019.
- (27) D. Teney, P. Anderson, X. He, and A. v. d. Hengel. Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge. arXiv preprint arXiv:1708.02711, 2017.
- (28) A. Trott, C. Xiong, and R. Socher. Interpretable counting for visual question answering. arXiv preprint arXiv:1712.08697, 2017.
- (29) J. Wu, Z. Hu, and R. J. Mooney. Joint image captioning and question answering. arXiv preprint arXiv:1805.08389, 2018.
- (30) J. Wu, D. Li, Y. Yang, C. Bajaj, and X. Ji. Dynamic Filtering with Large Sampling Field for Convnets. ECCV, 2018.
- (31) J. Wu and R. J. Mooney. Faithful multimodal explanation for visual question answering. arXiv preprint arXiv:1809.02805, 2018.
- (32) N. Xie, F. Lai, D. Doran, and A. Kadav. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
- (33) Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29, 2016.
- (34) J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102, 2018.
- (35) Y. Zhang, J. C. Niebles, and A. Soto. Interpretable visual question answering by visual grounding from attention supervision mining. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 349–357. IEEE, 2019.