Self-Critical Reasoning for Robust Visual Question Answering

05/24/2019 ∙ by Jialin Wu, et al. ∙ The University of Texas at Austin 0

Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e. 49.5% using textual explanations and 48.5% using automatically annotated regions.



There are no comments yet.


page 2

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, Visual Question Answering (VQA) antol2015vqa

has emerged as a challenging task that requires artificial intelligence (AI) systems to compute answers by jointly analyzing both natural language questions and visual content. The state-of-the-art VQA systems

fukui2016multimodal ; anderson2017bottom ; vqa-cp ; andreas2016neural ; hu2018explainable ; yang2016stacked ; wu2018joint ; selvaraju2019taking ; jiang2018pythia ; kim2018bilinear ; ramakrishnan2018overcoming achieve high performance when the training and test question-answer (QA) pairs are sampled from the same distribution. However, most of these systems fail to generalize to test data with a substantially different QA distribution. In particular, their performance drops catastrophically on the recently introduced Visual Question Answering under Changing Priors (VQA-CP) vqa-cp dataset. The strong language priors encourage systems to blindly capture superficial statistical correlations in the training QA pairs and simply output the most common answers, instead of reasoning about the relevant image regions on which a human would focus. For example, since about 40% of questions that begin with “what sport” have the answer “tennis”, systems tend to learn to output “tennis” for these questions regardless of image content.

A number of recent VQA systems trott2017interpretable ; zhang2019interpretable ; selvaraju2019taking ; qiao2018exploring learn to not only predict correct answers but also be “right for the right reasons” ross2017right ; selvaraju2019taking . These systems are trained to encourage the network to focus on regions in the image that humans have somehow annotated as important (which we will refer to as “important regions.”). However, many times the network also focuses on these important regions even when it produces a wrong answer. Previous approaches do nothing to actively discourage this phenomenon, which we have found occurs quite frequently. For example, as shown in Figure 1, we ask the VQA system “What is the man eating?”. The baseline system predicts “hot dog” but focuses on the banana because hot dog appears much more frequently in the training data. What’s worse, this error is hard to detect when only analyzing the correct answer “banana” that has been successfully grounded in the image.

Figure 1: Example of the common answer misleading the prediction even though the VQA system has the right reasons for the correct answer. Figure (a) shows the important regions extracted from human visual attention. Figure (b), (e) show the answers’ distribution of the question “What is the man eating?” in the training and test dataset. Figure (c), (d) show the most influential region for the prediction “hot dog” and “banana” using the baseline UpDn VQA system and Figure (f), (g) show the influential region for the prediction “hot dog” and “banana” using the VQA system after being trained with our self-critical objective. The number around the bounding box shows the answer’s sensitivity to the object.

To address this issue, we present a “self-critical” approach that directly criticizes incorrect answers’ sensitivity to the important regions. First, for each QA, we determine the important region that most influences the network’s prediction of the correct answer. We then penalize the network for focusing on this region when its predicted answer for this question is wrong.

Our self-critical approach is end-to-end trainable and only requires that the base VQA system be differentiable to the visual content, and thus can be applied to most current state-of-the-art systems. We investigated three approaches to determining important regions. First, like the previous work trott2017interpretable ; zhang2019interpretable ; selvaraju2019taking ; qiao2018exploring , we used regions that humans have explicitly marked as important. However, this requires a fair bit of extra human effort to provide such detailed annotations. So we also explored using human textual VQA explanations from the VQA-X park2018multimodal dataset to determine important objects which are then grounded to important regions in the image. Finally, we tried determining important regions by only using objects mentioned in the question or answer and grounding them in the image, which requires no additional human annotation of the VQA training data.

We evaluate our approach using the UpDn VQA system anderson2017bottom on the VQA-CP datasetvqa-cp and achieve a new state-of-the-art performance (currently 47.7%): 49.5% overall score with VQA-X park2018multimodal textual explanations, 49.1 % with VQA-HAT das2017human visual explanations and 48.5% using just mentioned objects in the questions and answers.

2 Related Work

2.1 Human Explanations for VQA

There are two main kinds of human explanations available for the most popular VQA dataset antol2015vqa , visual and textual explanations. The VQA-HAT dataset das2017human is a well-known visual explanation dataset that collects human attention maps by giving human experts blurred images and asking them to determine where to deblur in order to answer the given visual questions correctly. Alternatively, park2018multimodal present the VQA-X dataset that associates a textual explanation with each QA pair which a human has provided to justify an answer to a given question. In this work, we utilize both of these kinds of explanations to provide the important regions.

2.2 Language Priors in VQA

Language priors vqa-cp ; goyal2017making in VQA refer to the fact that questions and the answers are highly correlated. For instance, questions that begin with “How many” are usually answered by either two or three. These language priors allow VQA systems to take a shortcut when answering questions by only focusing on the questions without reasoning about the visual content. In order to prevent this circumstance, VQA v2 antol2015vqa balances the answer distribution so that there exist at least two similar images with different answers for each question. Recently, vqa-cp introduce a diagnostic reconfiguration of the VQA v2 dataset called VQA-CP where the distribution of the QA pairs in the training set are significantly different with those in the test set. Most state-of-the-art VQA systems are found to highly rely on language priors and experience a catastrophic performance drop on VQA-CP. We evaluate our approach on VQA-CP in order to demonstrate that it generalizes better and is less sensitive to distribution changes.

2.3 Improving VQA using Human Explanations

A desired property for VQA systems is to not only infer the correct answers to visual questions but also base the answer on image regions that a human believes are important,

right for the right reasons. The VQA systems that address this issue can be classified into two categories. The first trend is to build a system whose model is inherently interpretable. For example, GVQA

vqa-cp explicitly disentangles the vision and language components by introducing a separate visual concept verifier and answer cluster classifiers. The other trend is to align a systems’ explanation to human experts’ explanations for the correct answers. zhang2019interpretable ; qiao2018exploring align the internal attention weights over the image to the human attention maps. The work most related to ours is HINT selvaraju2019taking , which enforces the system’s gradient-based importance scores for each detected object to have the same rankings as its human importance scores. In contrast to prior work, our approach not only encourages the systems to be sensitive to the important regions identified by humans, but also decrease the incorrect answers’ sensitivity to these regions.

3 Preliminaries

In this section, we first introduce our base Bottom-up Top-down (UpDn) VQA system111Core building blocks in the VQA challenge wining entries in the last two years.anderson2017bottom . Then, we describe the method to construct a proposal object set that covers the most influential objects on which a human would focus when answering the question.

3.1 Bottom-Up Top-Down VQA

A large number of previous VQA systems fukui2016multimodal ; ben2017mutan ; ramakrishnan2018overcoming utilize a trainable Top-Down attention mechanism over convolutional features to recognize relevant image regions. anderson2017bottom introduced a complementary bottom-up attention that first detects common objects and attributes so that the top down attention can directly model the contribution of higher level concepts. This UpDn approach is utilized in lots of recent works selvaraju2019taking ; wu2018faithful ; jiang2018pythia ; xie2019visual ; shah2019cycle and significantly improves VQA performance.

Technically, on the vision side, for each image, UpDn systems first extract a visual feature set = for each image whose element

is a feature vector for the

-th detected object. On the language side, UpDn systems sequentially encode each question to produce a question vector q using a standard single-layer GRU cho2014learning denoted by , . Let denote the answer prediction operator that takes both visual features and question features as input and predicts the confidence for each answer in the answer candidate set ,

. The VQA task is framed as a multi-label regression problem with the gold-standard soft scores as targets in order to be consistent with the evaluation metric. In particular, the standard binary cross entropy loss

is used to supervise the sigmoid-normalized outputs.

3.2 Proposal Influential Object Set Construction

Our approach ideally requires identifying important regions that a human considers most critical in answering the question. However, directly obtaining such a clear set of influential object from either visual or textual explanations is hard, as the visual explanations also highlight the neighbour objects around the most influential one and grounding textual explanations in images is still an active research field. We relax this requirement by identifying a proposed set of influential objects for each QA pair. This set may we noisy and contain some irrelevant objects, but we assume that it at least includes the most relevant object. As previously mentioned, we explore three separate methods for constructing this proposal set, as described below:

Construction from Visual Explanations. Following HINTselvaraju2019taking , we use the VQA-HAT dataset das2017human as the visual explanation source. HAT maps contains a total of image-question pairs, corresponding to approximately 9% of the VQA-CP training and test set. We also inherit HINT’s object scoring system that is based on the normalized human attention map energy inside the proposal box relative to the normalized energy outside the box. We score each detected object from the bottom-up attention and build the potential object set by selecting the top objects.

Construction from Textual Explanations. Recently, park2018multimodal introduced a textual explanation dataset that annotates image-question pairs in total, corresponding to 5% of the entire VQA-CP dataset. To extract the potential object set, we first assign part-of-speech (POS) tags to each word in the explanation using the spaCy POS tagger spacy2

and extract the nouns in the sentence. Then, we select the detected objects whose cosine similarity between the Glove embeddings

pennington2014glove of their category names and any of the extracted nouns’ is greater than . Finally, we select the objects with highest similarity.

Construction from Questions and Answers. Since the above explanations may not be available in other datasets, we also consider a simple way to extract the proposal object set from just the training QA pairs alone. The method is quite similar to the way we construct the potential set from textual explanations. The only difference is that instead of parsing the explanations, we parse the QA pairs and extract nouns from them.

Figure 2: Model overview. In the left top block, the base UpDn VQA system first detects a set of objects and predicts an answer. We then analyze the correct answer’s sensitivity (Fork) to the detected objects via visual explanation and extract the most influential one in the proposal object set as the most influential object, which is also further strengthened via the influence strengthen loss (left bottom block). Finally, we analyze the competitive incorrect competitive answers’ sensitivities (Knife) to the most influential object and criticize the sensitivity till the VQA system answers the question correctly (right block). The number around the bounding box is the answer’s sensitivity to the object.

4 Approach

In this section, we present our self-critical approach to prevent the most common answer from dominating the correct answer given the proposal sets of influential objects. Figure 2 shows the overview of our approach. Besides the UpDn VQA system (left top block), our approach contains two other components, we first recognize and strengthen the most influential objects (left bottom block), and then we criticize incorrect answers that are more highly ranked than the correct answer and try to make them less sensitive to these key objects (right block). As recent research suggested that gradient-based methods more faithfully present the models’ decision making process selvaraju2019taking ; zhang2018top ; wu2018dynamic ; jain2019attention , we use the modified GradCAM selvaraju2017grad to compute the answer ’s sensitivity to the -th object features as shown in Eq. 1:


where denotes a vector with all 1’s.

4.1 Recognizing and Strengthening Influential Objects

Given a proposal object set and the entire detected object set , we identify the object that the correct answer is most sensitive to and further strengthen its sensitivity. We first introduce a sensitivity violation term for answer and the -th and -th object features and as the amount of sensitivity that surpass shown in Eq. 2.


Based on the assumption that the proposal set contains at least one influential object that a human would use to infer the answer, we impose the constraint that the most sensitive object in the proposal set should not be less sensitive than any object outside the proposal set. Therefore, we introduce the influence strengthen loss in Eq. 3:


where the denotes the ground truth answer. The key differences between our influence strengthen loss and the ranking-based HINT loss are that (1) we relax the unnecessary constraint that the objects should follow the exact human ranking, and (2) it is easier to adapt to different types of explanation ( textual explanation) where such detailed rankings are not available.

4.2 Criticizing Incorrect Dominant Answers

Next, for the incorrect answers ranked higher than the correct answer, we attempt to decrease the sensitivity of the influential objects. For example, in VQA-CP, bedrooms are the most common room type. Therefore, during testing, systems frequently incorrectly classify bathrooms (which are rare in the training data) as bedrooms. Since humans identify a sink as an influential object when identifying bathrooms, we want to decrease the influence on sinks on concluding bedroom.

In order to address this issue, we design a self-critical objective to criticize the VQA systems’ incorrect but competitive decisions based on the most influential object to which the correct answer is most sensitive as defined in Eq. 4.


Specifically, we extract a bucket of at most predictions with higher confidence than the correct answer = and utilize the proposed self-critical loss to directly minimize the weighted sensitivities of the answers in the bucket to the selected most influential object, as shown in Eq. 5.


where denotes the ground truth answer. Because several answer candidates could be similar ( and ), we weight the sensitivity gaps in Eq. 5 by the cosine distance between the answers’ - Glove embeddings pennington2014glove , . In the multi-word answer case, the Glove embeddings of these answers are computed as the sum of the individual word’s Glove embeddings.

4.3 Implementation and Training Details

In this section, we describe the detailed implementation and training procedure of our self-critical approach to VQA.

Training Details. We first pre-train our base UpDn VQA system on the VQA-CP training set using standard VQA loss (binary cross entropy loss with soft scores as supervision) with the Adam optimizer kingma2014adam

for 10 epochs. As suggested in

teney2017tips , the learning rate is fixed to 10e-3 with a batch size of 512 during the pre-training process, and we use hidden units in the base UpDn VQA system. After that, we finetune the VQA system with the joint loss for another epochs using the Adam optimizer with a learning rate of 10e-4 and a batch size of 512.

In our experiments, we also vary the self-critical loss weight to analyze the impact of the self-critical loss. The bucket size of the competitive answers is set to 5 because we observed that the top- overall score of the pre-trained system on the VQA-CP dataset achieves 80.4%, and increasing the bucket size only marginally improves the score.

Implementation. We implemented our approach on top of the publicly available reimplementation222 of the original UpDn system. The base system utilizes a Faster R-CNN head girshick2015fast in conjunction with a ResNet-101 base network he2016deep as the object detection module. The detection head is pre-trained on the Visual Genome dataset krishna2017visual and is capable of detecting objects categories and attributes. UpDn takes the final detection outputs and performs non-maximum suppression (NMS) for each object category using an IoU threshold of . Then, the convolutional features for the top objects are extracted for each image as the visual features, a dimensional vector for each object. For question embedding, following anderson2017bottom , we perform standard text pre-processing and tokenization. In particular, questions are first converted to lower case and then trimmed to a maximum of words, and the words that appear less than times are replaced with an “unk” token. A single layer GRU cho2014learning is used to sequentially process the word vectors and produce a sentential representation for the pre-processed question. We also use Glove vectors pennington2014glove to initialize the word embedding matrix when embedding the questions.

5 Experimental Results

We conduct experiments mainly on the VQA-CP (Visual Question Answering with Changing Priors) vqa-cp dataset where the QA pairs in the training data and test data have significantly different distributions. We also present experimental results on the VQA v2 validation set for completeness. We compare our self-critical system’s VQA performance with the start-of-the-art systems via the standard evaluation metric. After that, we perform ablation studies to verify the contribution of strengthening the influential objects and criticizing competitive answers. Finally, we show some qualitative examples to illustrate the effectiveness of criticizing the false answers’ sensitivity.

Expl. VQA-CP v2 test VQA v2 val
All Yes/No Num Other All Yes/No Num Other
GVQAvqa-cp 31.3 58.0 13.7 22.1 48.2 72.0 31.2 34.7
UpDn anderson2017bottom 39.7 42.7 11.9 46.1 63.5 81.2 42.1 55.7
UpDn+AttAlign selvaraju2019taking 38.5 42.5 11.4 43.8 61.0 78.9 38.4 53.3
UpDn+AdvReg. ramakrishnan2018overcoming 41.2 65.5 15.5 35.5 62.8 79.8 42.4 55.2
UpDn+SCR (ours) QA 48.5 70.5 10.4 47.3 62.3 77.4 40.9 56.5
UpDn+HINT selvaraju2019taking HAT 47.7 70.0 10.7 46.3 62.5 80.5 41.8 54.0
UpDn+SCR (ours) HAT 49.2 71.6 10.7 47.8 62.2 78.9 41.4 54.3
UpDn+SCR (ours) VQA-X 49.5 71.6 11.3 48.4 62.2 78.8 41.6 54.5
Table 1: Comparison of the results on VQA-CP test and VQA v2 validation dataset with the state-of-the-art systems. The first half includes VQA systems without human explanations during training and the VQA systems in the second half use either visual or textual explanations. The “Expl.” column shows the source of explanations for training the VQA systems. Accuracies in percentages (%) are reported. SCR is the short hand for our self-critical reasoning approach

5.1 VQA Performance on VQA-CP and VQA v2 dataset

We first report the quantitative results of the VQA generalization task and compare our results with the state-of-the-art methods in this section. We also report our system’s performance on the balanced VQA v2 validation set for completeness.

As demonstrated in Table 1, our system significantly outperforms other state-of-the-art systems (e.g. HINT selvaraju2019taking ) by overall scores when using the same amount of human visual explanation (VQA-HAT), which indicates the effectiveness of directly criticizing the competitive answers’ sensitivity to the most influential objects. We also observe that using human textual explanations as supervision is somewhat more effective. With only about half the number of explanations compared to VQA-HAT, these textual explanations help the VQA system to perform better by 0.3% overall score, achieving the new state-of-the-art score of .

Besides, we observe that our self-critical objective helps the VQA system especially in the ’Yes/No’ and ’Other’ question categories; however, it does not do well in the ’Num’ category. This is understandable because counting problems are generally harder than the other two types, and requires the VQA system to jointly consider all the objects. Therefore, criticizing only the most sensitive ones does not improve the performance.

In VQA v2 test dataset, the VQA systems trained using our self-critical loss perform competitively well compared with previous approaches. This indicates that criticizing the wrong answers’ sensitivities does not hurt performance substantially even when the training and test data have the same distribution.

5.2 Ablation Study on the Self-Critical Loss

In this section, we evaluate the impact of varying the weight of the self-critical loss on the VQA-CP test data using VQA-HAT visual explanations. Results are in Table 2. We observe that without to strengthen the most influential objects for the correct answer, our self-critical approach helps the UpDn VQA system to improve on the overall score. Combining with the loss, our approach which includes the self-critical loss () sets a new state-of-the-art score () on the VQA-CP test set using visual explanations. We also notice that our approach is robust to changes in the weigth of the self-critical loss and consistently improves VQA performance for a wide range of loss weights as demonstrated in Table 2.

Expl. VQA-CP v2 test
All Yes/No Num Other
GVQA vqa-cp 31.3 58.0 13.7 22.1
UpDn anderson2017bottom 39.7 42.7 11.9 46.1
UpDn+SCR (ours) HAT 46.0 64.4 11.1 42.3
UpDn+SCR (ours) HAT 47.9 67.6 12.5 47.3
UpDn+SCR (ours) HAT 48.8 71.0 11.3 47.5
UpDn+SCR (ours) HAT 49.2 71.6 10.7 47.8
UpDn+SCR (ours) HAT 48.7 71.1 11.5 47.2
Table 2: Ablation study on various self-critical loss weights on VQA-CP test data. The first half reports the VQA systems without human explanations during training and the VQA systems in the second half use either visual or textual explanations. The “Expl.” column shows the source of explanations for training the VQA systems. The “” column shows the self-critical loss weights. SCR is the short hand for our self-critical reasoning approach. Accuracies in percentage (%) are reported.

5.3 Effectiveness of Criticizing False Sensitivity

In this section, we quantitatively evaluate the effectiveness of the proposed self-critical objective. In particular, we evaluate the fraction of false sensitivity where the predicted incorrect answer’s sensitivity to the influential object (to which the correct answer is most sensitive) is greater than the correct answer’s sensitivity.

We formally define the false sensitivity rate in Eq. 6:


where denote the function that returns 1 if the condition is satisfied and returns 0 otherwise.

For the original UpDn VQA system, we observe a false sensitivity rate of 35.5% among all the test QA pairs in the VQA-CP data. After the self-critical training, the false sensitivity rate reduces to 20.4% using the VQA-HAT visual explanations and to 19.6% using VQA-X textual explanations. This indicates that false sensitivity is a common problem in VQA systems and indicates the utility of criticizing them.

Some sampled examples of how our self-critical approach mitigates false sensitivity are shown in Figure 3. We observe that the most influential object (for the correct answer) becomes more influential, which we attribute to the influence strengthening part. More importantly, we observe that this object’s influence on the incorrect answer decreases and sometimes falls below other objects.

Figure 3: Positive examples showing that our self-critical reasoning approach prevents the incorrectly predicted answer in the UpDn baseline system from being sensitive to the most influential object. For each example, the top two figures show the object, to which the ground truth (left) and incorrectly predicted (right) answers are sensitive. The bottom two figures show the corresponding most influential object after our self-critical training. Note that the attention for the incorrect answer shifts to a more relevant part of the image for that answer. The number around the bounding box is the answer’s sensitivity to the object.
UpDn UpDn + QA UpDn + HAT UpDn + VQA-X
35.5% 22.6% 20.4% 19.6%
Table 3: False sensitivity rate () comparison of using different types of human explanations.

6 Conclusion and Future Work

In this work, we have explored how to improve VQA performance by criticizing the sensitivity of incorrect answers to the most influential object for the correct answer. Our “self critical” approach helps VQA systems generalize to test data where the distribution of question-answer pairs is significantly different from the training data. The influential objects are selected from a proposal set extracted from human visual or textual explanations, or simply from the mentioned objects in the questions and answers. Our approach outperforms the state-of-the-art VQA systems on the VQA-CP dataset by a clear margin even without human explanations as additional supervision. In the future, we would like to combine the visual and the textual explanations together to better train VQA systems. This is difficult because the proposal object sets of these two types of explanations contain different types of noise (i.e. question-irrelevant objects), and therefore different biases.