SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions

Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks – tasks that can only be answered through a synthesis of perception and knowledge about the world, logic and / or reasoning. This distinction allows us to notice when existing VQA models have consistency issues – they answer the reasoning question correctly but fail on associated low-level perception questions. For example, models answer the complex reasoning question "Is the banana ripe enough to eat?" correctly, but fail on the associated perception question "Are the bananas mostly green or yellow?" indicating that the model likely answered the reasoning question correctly but for the wrong reason. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting Sub-VQA, a new dataset consisting of 200K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split. Additionally, we propose an approach called Sub-Question Importance-aware Network Tuning (SQuINT), which encourages the model to attend do the same parts of the image when answering the reasoning question and the perception sub questions. We show that SQuINT improves model consistency by 7.8 questions in VQA, while also displaying qualitatively better attention maps.


page 5

page 7

page 8


SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency

Recent research in Visual Question Answering (VQA) has revealed state-of...

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Visual reasoning tasks such as visual question answering (VQA) require a...

Co-VQA : Answering by Interactive Sub Question Sequence

Most existing approaches to Visual Question Answering (VQA) answer quest...

Towards VQA Models that can Read

Studies have shown that a dominant class of questions asked by visually ...

Understanding Knowledge Gaps in Visual Question Answering: Implications for Gap Identification and Testing

Visual Question Answering (VQA) systems are tasked with answering natura...

Multi-Target Embodied Question Answering

Embodied Question Answering (EQA) is a relatively new task where an agen...

Can you even tell left from right? Presenting a new challenge for VQA

Visual Question Answering (VQA) needs a means of evaluating the strength...

1 Introduction

Figure 1: A potential reasoning failure: Current models answer “Yes” correctly to the Reasoning question “Is the banana ripe enough to eat?”. We might assume that correctly answering the Reasoning question stems from perceiving relevant concepts correctly – perceiving yellow bananas in this example. But when asked “Are the bananas mostly green or yellow?”, it answers “Green” incorrectly – indicating that the model possibly answered the original for the wrong reasons even if the answer was right. We quantify the extent to which this phenomenon occurs in VQA and introduce a new dataset aimed at stimulating research on well grounded reasoning.

Human cognition is thought to be compositional in nature: the visual system recognizes multiple aspects of a scene which are combined into shapes [7] and understandings. Likewise, complex linguistic expressions are built from simpler ones [5]. Tasks like Visual Question Answering (VQA) require models to perform inference at multiple levels of abstraction. For example, to answer the question “Is the banana ripe enough to eat?” (Figure 1), a VQA model has to be able to detect the bananas and extract associated properties such as size and color (perception), understand what the question is asking, and reason about how these properties relate to known properties of edible bananas (ripeness) and how they manifest (yellow versus green in color). While “abstraction” is complex and spans distinctions at multiple levels of detail, we focus on separating questions into Perception and Reasoning questions. Perception questions only require visual perception to recognize existence, physical properties or spatial relationships among entities, such as “What color is the banana?”, or “What is to the left of the man?” while Reasoning questions require the composition of multiple perceptual tasks and knowledge that harnesses logic and prior knowledge about the world, such as “Is the banana ripe enough to eat?”.

Current VQA datasets [3, 6, 14] contain a mixture of Perception and Reasoning questions, which are considered equivalent for the purposes of evaluation and learning. Categorizing questions into Perception and Reasoning  promises to promote a better assessment of visual perception and higher-level reasoning capabilities of models, rather than conflating these capabilities. Furthermore, we believe it is useful to identify the Perception questions that serve as subtasks in the compositional processes required to answer the Reasoning question. By elucidating such “sub-questions,” we can check whether the model is reasoning appropriately or if it is relying on spurious shortcuts and biases in datasets [1] For example, we should be cautious about the model’s inferential ability if it simultaneously answers “no” to “Are the bananas edible?” and “yellow” to “What color are the bananas?”, even if the answer to the former question is correct. The inconsistency between the higher-level reasoning task and the lower-level perception task that it builds on suggests that the system has not effectively learned how to answer the Reasoning question and will not be able to generalize to same or closely related Reasoning question with another image. The fact that these sub-questions are in the same modality (i.e. questions with associated answers) allows for the evaluation of any VQA model, rather than only models that are trained to provide justifications, and it is this key observation that we use to develop an evaluation methodology for Reasoning questions.

The dominant learning paradigm assumes that models are given image, question, answer triplets, with no additional annotation on the relationship between the question and the compositional steps required to arrive at the answer. As reasoning questions become more complex, achieving good coverage and generalization with methods used to date will likely require a prohibitive amount of data. We employ a hierarchical decomposition strategy, where we identify and link Reasoning questions with sets of appropriate Perception sub-questions. Such an approach promises to enable new efficiencies via compositional modeling, as well as lead to improvements in the consistency of models for answering Reasoning questions. Explicitly representing dependencies between Reasoning tasks and the corresponding Perception tasks also provides language-based grounding for reasoning questions where visual grounding [13, 17] may be insufficient, e.g., highlighting that the banana is important for the question in Figure 1 does not tell the model how it is important (i.e. that color is an important property rather than size or shape). Again, the fact that such grounding is in question-answer form (which models already have to deal with) is an added benefit. Such annotations allow for attempts to enforce reasoning devoid of shortcuts that do not generalize, or are not in line with human values and business rules, even if accurate (e.g. racist behavior).

We propose a new split of the VQA dataset, containing only Reasoning questions (as defined previously). Furthermore, for questions in the split, we introduce Sub-VQA, a new dataset of 132k associated Perception  sub-questions which humans perceive as containing the components needed to answer the original questions. After validating the quality of the new dataset, we use it to perform fine-grained evaluation of state-of-the-art models, checking whether their reasoning is in line with their perception. We show that state-of-the-art VQA models have similar accuracy in answering perception and reasoning tasks but that they have problems with consistency; in 28.14% of the cases where models answer the reasoning question correctly, they fail to answer the corresponding perception sub-question, highlighting problems with consistency and the risk that models may be learning to answer reasoning questions through learning common answers and biases.

Finally, we introduce SQuINT – a generic modeling approach that is inspired by the compositional learning paradigm observed in humans. SQuINT uses the additional Sub-VQA annotations by encouraging image regions important for the sub-question to play a role in answering the main Reasoning question, and demonstrate that it results in models that are more consistent across Reasoning and associated Perception tasks with no loss of accuracy. We also find that SQuINT improves model attention maps for Reasoning questions, thus making models more trustworthy.

2 Related Work

Visual Question Answering [3], one of the most widely studied vision-and-language problems, requires associating image content with natural language questions and answers (thus combining perception, language understanding, background knowledge and reasoning). However, it is possible for models to do well on the task by exploiting language and dataset biases, e.g. answering “yellow” to “What color is the banana?” without regard for the image or answering “yes” to most yes-no questions [1, 17, 19, 2]. This motivates additional forms of evaluation, e.g. checking if the model can understand question rephrasings [18] or whether it exhibits logical consistency [15]. In this work, we present a novel evaluation of questions that require reasoning and background knowledge, where we distinguish the visual perception aspect from the reasoning aspect through associated sub questions as reasoning components.

A variety of datasets have been released with attention annotations on the image pointing to regions that are important to answer questions ([4, 10]), with corresponding work on enforcing such grounding [16, 13, 17]. Our work is complementary to these approaches, as we provide language-based grounding (rather than visual), and further evaluate the link between perception components and how they are composed by models (e.g. in Figure 1 visual grounding would point at the bananas, while we evaluate if the model understands how their color is what is important, and try to enforce it during learning). Closer to our work is the dataset of Lisa et al. [10], where natural language justifications are associated with (question, answer) pairs. However, most of the questions contemplated (like much of the VQA dataset) pertain to perception questions (e.g. for the question-answer “What is the person doing? Snowboarding”, the justification is “…they are on a snowboard …”). Furthermore, it is hard to use natural language justifications to evaluate models that do not generate similar rationales (i.e. most SOTA models), or even coming up with metrics for models that do. In contrast, our dataset and evaluation is in the same modality (QA) that models are already trained to handle.

3 Reasoning-VQA and Sub-VQA

In the first part of this section, we present an analysis of the common type of questions in the VQA dataset and highlight the need for classifying them into Perception and Reasoning questions. We then define Perception and Reasoning questions and describe our method for constructing the Reasoning split. In the second part, we describe how we collect sub-questions and answers for questions in our Reasoning split. Finally, we describe experiments conducted in order to validate the quality of our collected data.

3.1 Perception vs. Reasoning

A common technique for finer grained evaluation of VQA models is grouping instances by answer type (yes/no, number, other) or by the first words of the question (what color, how many, etc) [3]. While useful, such slices are coarse and do not evaluate the model’s capabilities at different points in the abstraction scale. For example, questions like “Is this a banana?” and “Is this a healthy food?” start with the same words and expect yes/no answers. While both test if the model can do object recognition, the latter requires additional prior knowledge about which food items are healthy and which are not. This is not to say that Reasoning questions are inherently harder, but that they require both visual understanding and an additional set of skills (logic, prior knowledge, etc) while Perception questions deal mostly with visual understanding. For example, the question “How many round yellow objects are to the right of the smallest square object in the image?” requires very complicated visual understanding, and is arguably harder than “Is the banana ripe enough to eat?”, which requires relatively simple visual understanding (color of the bananas) and knowledge about properties of ripe bananas. Regardless of difficulty, categorizing questions as Perception or Reasoning is useful for evaluation and learning, as we demonstrate in later sections. We now proceed to define these categories more formally.

Perception : We define Perception questions as those which can be answered by detecting and recognizing the existence, physical properties and / or spatial relationships between entities, recognizing text / symbols, simple activities and / or counting, and that do not require more than one hop of reasoning or general commonsense knowledge beyond what is visually present in the image. Some examples are: “Is that a cat? ” (existence), “Is the ball shiny?” (physical property), “What is next to the table?” (spatial relationship), “What does the sign say?” (text / symbol recognition), “Are the people looking at the camera?” (simple activity), etc. We note that spatial relationship questions have been considered reasoning tasks in previous work [9] as they require lower-level perception tasks in composition to be answered. For our purposes it is useful to separate visual understanding from other types of reasoning and knowledge, and thus we classify such spatial relationships as Perception.

Reasoning : We define Reasoning questions as non-Perception questions which require the synthesis of perception with prior knowledge and / or reasoning in order to be answered. For instance, “Is this room finished or being built?”, “At what time of the day would this meal be served?”, “Does this water look fresh enough to drink?”, “Is this a home or a hotel?”, “Are the giraffes in their natural habitat?” are all Reasoning  questions.

Our analysis of the perception questions in the VQA dataset revealed that most perception questions have distinct patterns that can be identified with high precision regex-based rules. By hand-crafting such rules (details in the Supplementary) and filtering out perception questions, we identify  18% of the VQA dataset as highly likely to be Reasoning. To check the accuracy of our rules and validate their coverage of Reasoning questions, we designed a crowdsourcing task on Mechanical Turk that instructed workers to identify a given VQA question as Perception or Reasoning, and to subsequently provide sub-questions for the Reasoning questions, as described next. 94.7% of the times, trained workers classified our resulting questions as reasoning questions demonstrating the high precision of the regex-based rules we created.

3.2 Sub-VQA

Given the complexity of distinguishing between Perception / Reasoning and providing sub-questions for Reasoning questions, we first train and filter workers on Amazon Mechanical Turk (AMT) via qualification rounds before we rely on them to generate high-quality sub-questions.

Worker Training - We manually annotate questions from the VQA dataset as Perception and as Reasoning questions, to serve as examples. We first teach workers the difference between Perception and Reasoning questions by defining them and showing several examples of each, along with explanations. Then, workers are shown (question, answer) pairs and are asked to identify if the given question is a Perception question or a Reasoning question 222We also add an “Invalid” category to flag nonsensical questions or those which can be answered without looking at the image. Finally, for Reasoning  questions, we ask workers to add all Perception questions and corresponding answers (in short) that would be necessary to answer the main question (details and interface in the Supplementary). In this qualification HIT, workers have to make 6 Perception and Reasoning judgments, and they qualify if they get 5 or more answers right.

We launched further pilot experiments for the workers who passed the first qualification round, where we manually evaluated the quality of their sub-questions based on whether they were Perception questions grounded in the image and sufficient to answer the main question. Among those 463 workers who passed the first qualification test, 91 were selected (via manual evaluation) as high-quality workers, which finally qualified for attempting our main task.

Main task - In the main data collection, all VQA questions that got identified as Reasoning by regex-rules and a random subset of the questions identified as Perception were further judged by workers (for validation purposes). We eliminated ambiguous questions by further filtering out questions where there is high worker disagreement about the answer. We require at least 8 out of 10 workers to agree with the majority answer for yes/no questions and 5 out of 10 for all other questions, which leaves us with a split that corresponds to 13% of the VQA dataset333Until the time of submission, we have collected sub questions for VQAv1 train which corresponds to 27441 Reasoning questions and 79905 sub questions for them. For VQAv2 val we have 15448 Reasoning questions and 52573 sub questions for them.. This Reasoning split is not exhaustive, but is high precision (as demonstrated below) and contains questions that are not ambiguous, and thus is useful for evaluation and learning.

Each question, image pair labeled as Reasoning had sub questions generated by by 3 unique workers 444A small number of workers displayed degraded performance after the qualification round, and were manually filtered. On average we have 2.60 sub-questions per Reasoning question. Qualitative examples are shown in Fig. 2.

Figure 2: Qualitative examples of Perception sub-questions in our Sub-VQA dataset for main questions in the Reasoning split of VQA. Main questions are written in orange and sub questions are in blue. A single worker may have provided more than one sub questions for the same (image, main question) pair.

3.3 Dataset Quality Validation

In order to confirm that the sub-questions in Sub-VQA are really Perception questions, we did a further round of evaluation with workers who passed the worker qualification task described in Section 3.2 but had not provided sub-questions for our main task. In this round, of sub-questions in Sub-VQA were judged to be Perception questions by at least 2 out of 3 workers.

It is crucial for the semantics of Sub-VQA that the sub-questions are tied to the original Reasoning question. While verifying that the sub-questions are necessary to answer the original question requires workers to think of all possible ways the original question could be answered (and is thus too hard), we devised an experiment to check if the sub-questions provide at least sufficient visual understanding to answer the Reasoning question. In this experiment, workers are shown the sub-questions with answers, and then asked to answer the Reasoning question without seeing the image, thus having to rely only on the visual knowledge conveyed by the sub-questions. At least 2 out of 3 workers were able to answer 89.3% of the Reasoning questions correctly in this regime (95.4% of binary Reasoning questions). For comparison, when we asked workers to answer Reasoning questions with no visual knowledge at all (no image and no sub-questions), this accuracy was 52% (58% for binary questions). These experiments give us confidence that the sub-questions in Sub-VQA are indeed Perception questions that convey components of visual knowledge which can be composed to answer the original Reasoning questions.

4 Dataset Analysis

Figure 3: Left: Distribution of questions by their first four words. The arc length is proportional to the number of questions containing the word. White areas are words with contributions too small to show, Right: Distribution of answers per question type
Figure 4: Percentage of questions with different word lengths for the train and val sub-questions of our Sub-VQA dataset.

The distribution of questions in our Sub-VQA dataset is shown in Figure 3. It is interesting to note that comparing these plots with those for the VQA dataset [3] show that the Sub-VQA dataset questions are more specific. For example, there are 0 “why” questions in the dataset which tend to be reasoning questions. Also, for “where” questions, a very common answer in VQA was “outside” but answers are more specific in our Sub-VQA dataset (e.g., “beach”, “street”). Figure 4 shows the distribution of question lengths in the Perception and Reasoning splits of VQA and in our Sub-VQA dataset. We see that most questions range from 4 to 10 words. Lengths of questions in the Perception and Reasoning splits are quite similar, although questions in Sub-VQA are slightly longer (the curve is slightly shifted to the right), possibly on account of the increase in specificity/detail of the questions.

One interesting question is whether the main question and the sub-questions deal with the same concepts. In order to explore this, we used noun chunks surrogates for concepts 555extracted with the Python spaCy library, and measured how often there was any overlap in concepts between the main question and the associated sub-question. Noun-chunks are only a surrogate and may miss semantic overlap otherwise present (e.g. through verb-noun connections like “fenced” and “a fence” in Figure 2 (b), sub-questions). With this caveat, we observe that there is any overlap only 19.19% of the time, indicating that Reasoning questions in our split often require knowledge about concepts not explicitly mentioned. The lack of overlap indicates how the model has to perform language-based reasoning or use background knowledge in addition to visual perception. For example, in the question “Is the airplane taking off or landing?”, the concepts present are ‘airplane’ and ‘landing’, while for the associated sub-question “Are the wheels out?”, the concept is ‘wheels’. While wheels are not mentioned explicitly in the main question, the concept is important, such that providing this grounding might help the model explicitly associate the connection between airplane ‘wheels’ and take-offs landings.

5 Fine grained evaluation of VQA Reasoning

Sub-VQA enables a more detailed evaluation of the performance of current state-of-the-art models on Reasoning questions by checking whether correctness on these questions is consistent with correctness on the associated Perception sub-questions. It is important to notice that a Perception failure (an incorrect answer to a sub-question) may be due to a problem in the vision part of the model or a grounding problem – the model in Figure 5 may know that the banana is mostly yellow and use that information to answer the ripeness question, while, at the same time, fail to associate this knowledge with the word “yellow”, or fail to understand what the sub-question is asking. While grounding problems are not strictly visual perception failures, we still consider them Perception failures because the goal of VQA is to answer natural language questions about an image, and the sub-question being considered pertain to Perception knowledge as defined previously. With this caveat, there are four possible outcomes when evaluating Reasoning questions with associated Perception sub-questions, which we divide into four quadrants:

Q1: Both main & sub-questions correct (M✓ S✓): While we cannot claim that the model predicts the main question correctly because of the sub-questions (e.g. the bananas are ripe because they are mostly yellow), the fact that it answers both correctly is consistent with good reasoning, and should give us more confidence in the original prediction.

Q2: Main correct & sub-question incorrect (M✓ S✗): The Perception failure indicates that there might be a reasoning failure. While it is possible that the model is composing other perception knowledge that was not captured by the identified sub-questions (e.g. the bananas are ripe because they have black spots on them), it is also possible (and more likely) that the model is using a spurious shortcut or was correct by random chance.

Q3: Main incorrect & sub-question correct (M✗ S✓): The Perception failure here indicates a clear reasoning failure, as we validated that the sub-questions are sufficient to answer the main question. In this case, the model knows that the bananas are mostly yellow and still thinks they are not ripe enough, and thus it failed to make the “yellow bananas are ripe” connection.

Q4: Both main & sub-question incorrect (M✗ S✗): While the model may not have the reasoning capabilities to answer questions in this quadrant, the Perception failure could explain the incorrect prediction.

In sum, Q2 and Q4 are definitely Perception failures, Q2 likely contains Reasoning failures, Q3 contains Reasoning failures, and we cannot judge Reasoning in Q4.

As an example, we evaluate the Pythia model [11] (SOTA as of 2018)666source: https://visualqa.org/roe_2018.html along these quadrants (Table 1) for the Reasoning split of the VQA dataset. The overall accuracy of the model is 60.26%, while accuracy on Reasoning questions is 65.99%. We note that for 28.14% of the cases, the model is inconsistent, i.e., it answered the main question correctly, but got the sub question wrong. Further, we observe that 14.92% of the time the Pythia model gets all the sub questions wrong when the main question is right – that is, it seems to be severely wrong on its perception.

6 Improving learned models with Sub-VQA

Figure 5: Our Sub-Question Importance-aware Network Tuning (SQuINT) approach: Given an image, a Reasoning question like “What season is it?” and an associated Perception sub-question like “Is there a Christmas tree pictured on a cell phone?”, we pass them through the Pythia architecture [11]. For the example shown, the model incorrectly answers “Fall” for the main Reasoning question “What season is it?” but correctly answers “Yes” for the Perception sub-question. By using an image embedding conditioned on sub-question and image features, we encourage the model to use the right region for reasoning. In addition we also add an auxiliary loss that encourages the attention for the main question to be similar to that for the sub question.

In this section, we consider how Sub-VQA can be used to improve models that were trained on VQA datasets. Our goal is to reduce the number of possible reasoning or perception failures (M✓ S✗ and M✗ S✓) without hurting the original accuracy of the model.

6.1 Finetuning

The simplest way to incorporate Sub-VQA into a learned model is to finetune the model on it. However, a few precautions are necessary: we make sure that sub-questions always appear on the same batch as the original question, and use the averaged binary cross entropy loss for the main question and the sub question as a loss function. Furthermore, to avoid catastrophic forgetting

[12] of the original VQA data during finetuning, we augment every batch with randomly sampled data from the original VQA dataset. In the next section, we compare this approach with finetuning on the same amount of randomly sampled Perception questions from VQAv2.

6.2 Sub-Question Importance-aware Network Tuning (SQuINT)

The intuition behind Sub-Question Importance-aware Network Tuning (SQuINT) is that a model should attend to the same regions in the image when answering the Reasoning questions as it attends to when answering the associated Perception sub-questions, since they capture the visual components required to answer the main question. SQuINT does this by learning how to attend to sub-question regions of interest and reasoning over them to answer the main question. We now describe the losses associated with these steps in detail.

Attention loss - As described in Section 3, the sub-questions in our dataset are simple perception questions asking about well-grounded objects/entities in the image. Current well-performing models based on attention are generally good at visually grounding regions in the image when asked about simple Perception questions, given that they are trained on VQA datasets which contain large amounts of Perception questions. In order to make the model look at the associated sub-question regions while answering the main question, we apply a Mean Squared Error (MSE) loss over the the spatial and bounding box attention weights.

Cross Entropy loss - While the attention loss encourages the model to look at the right regions given a complex Reasoning question, we need a loss that helps the model learn to reason given the right regions. Hence we apply the regular Binary Cross Entropy loss on top of the answer predicted for the Reasoning question given the sub-question attention. In addition we also use the Binary Cross Entropy loss between the predicted and GT answer for the sub-question.

Total SQuINT loss - We jointly train with the attention and cross entropy losses. Let and be the model attention for the main reasoning question and the associated sub-question, and and be the ground-truth answers for the main and sub-question respectively. Let be the predicted answer for the reasoning question given the attention for the sub-question. The SQuINT loss is formally defined as:

The first term encourages the network to look at the same regions for reasoning and associated perception questions, while the second and third terms encourage it to give the right answers to the questions given the attention regions. The loss is simple and can be applied as a modification to any model that uses attention.

7 Experiments

Consistency Metric VQA Accuracy
Method M✓ S✓ M✓ S✗ M✗ S✓ M✗ S✗ Consistency% Consistency% (balanced) Attn Corr Overall Reasoning (M✓ S✓ + M✓ S✗)
Pythia 47.42 18.57 20.70 13.31 71.86 69.57 0.71 60.26 65.99
Data Aug Pythia + random data 47.25 18.80 20.47 13.48 71.54 71.46 0.70 60.66 66.05
Pythia + Sub-VQA data 52.54 13.55 22.50 11.41 79.50 75.44 0.71 60.20 66.09
Pythia + SQuINT 52.96 13.55 22.04 11.45 79.63 75.80 0.78 60.12 66.51


Table 1: Results on held out VQAv2 validation set for (1) Consistency metrics along the four quadrants described in Section 5 and Consistency and Attention Correlation metrics as described in Section 5 (metrics), and (2) Overall and Reasoning accuracy. The Reasoning accuracy is obtained by only looking at the number of times the main question is correct (M✓ S✓ + M✓ S✗).
Figure 6: Qualitative examples showing the model attention before and after applying SQuINT. (a) shows an image along with the reasoning question, ‘Did the giraffe escape from the zoo?’, for which the Pythia model looks at somewhat irrelevant regions and answers “Yes” incorrectly. Note how the same model correctly looks at the fence to answer the easier sub-question, ‘Is the giraffe fenced in?’. After applying SQuINT, which encourages the model to use the perception based sub question attention while answering the reasoning question, now looks at the fence and correctly answers the main reasoning question.

In this section, we perform fine grained evaluation of VQA reasoning as detailed in Section 5, using the SOTA model Pythia [11] as a base model (although any model that uses visual attention would suffice). We trained the base model on VQAv1, and evaluated the baseline and all variants on the Reasoning Split and corresponding Sub-VQA sub-questions of VQAv2. This allowed us to compare the impact of Sub-VQA Perception questions to the alternative of just getting more Perception questions from a balanced dataset. Our baseline (Pythia + random data), thus, is finetuning the base model on Perception questions of VQAv2. As detailed in Section 6, Pythia + Sub-VQA data corresponds to finetuning the base model on Sub-VQA subquestions of VQAv1, while Pythia + SQuINT is trained such that model attends to the same regions on main questions and associated sub-questions (again, of VQAv1). In Table 1, we report the reasoning breakdown detailed in Section 5. We also report a few additional metrics: Consistency refers to how often the model predicts the sub-question correctly given that it answered the main question correctly, while Consistency (balanced) reports the same metric on a balanced version of the sub-questions (to make sure models are not exploiting biases to gain consistency). Attention Correlation refers to the correlation between the attention embeddings of the main and sub-question. Finally, we report Overall accuracy (on the whole evaluation dataset), and accuracy on the Reasoning split (Reasoning Accuracy).

The results in Table 1 indicate that adding additional data (even VQAv2 data) does not improve the base model on our fine-grained evaluation. Fine-tuning on Sub-VQA (using data augmentation or SQuINT), on the other hand, increases consistency without hurting accuracy or Reasoning accuracy. Correspondingly, our confidence that it actually learned the necessary concepts when it answered Reasoning questions correctly should increase.

The Attention Correlation numbers indicate that SQuINT really is helping the model use the appropriate visual grounding (same for main-question as sub-questions) at test time, even though the model was trained on VQAv1 and evaluated on VQAv2. This effect does not seem to happen when additional VQAv2 data is added, or with naive finetuning on Sub-VQA. We present qualitative validation examples in Figure 6, where the base model attends to irrelevant regions when answering the main question (even though it answers correctly), while attending to relevant regions when asked the sub-question. The model finetuned on SQuINT, on the other hand, attends to regions that are actually informative in both main and sub-questions (notice that this is evaluation, and thus the model is not aware of the sub-question when answering the main question and vice versa). This is further indication that SQuINT is helping the model reason in ways that will generalize when it answers Reasoning questions correctly, rather than use shortcuts.

8 Discussion and Future Work

The VQA task requires multiple capabilities in different modalities and at different levels of abstraction. We introduced a hard distinction between Perception and Reasoning which we acknowledge to be a simplification of a continuous and complex reality, albeit a useful one. In particular, linking the perception components that are needed (in addition to other forms of reasoning) to answer reasoning questions opens up an array of possibilities for future work, in addition to improving evaluation of current work. We proposed preliminary approaches that seem promising: finetuning on Sub-VQA and SQuINT both improve the consistency of the SOTA model with no discernible loss in accuracy, and SQuINT results in qualitatively better attention maps. We expect future work to use Sub-VQA even more explicitly in the modeling approach, similar to current work in explicitly composing visual knowledge to improve visual reasoning [8]. In addition, similar efforts to ours could be employed at different points in the abstraction scale, e.g. further dividing complex Perception questions into simpler components, or further dividing the Reasoning part into different forms of background knowledge, logic, etc. We consider such efforts crucial in the quest to evaluate and train models that truly generalize, and hope Sub-VQA spurs more research in that direction.


  • [1] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In

    IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §1, §2.
  • [2] L. Anne Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach (2018) Women also snowboard: overcoming bias in captioning models. In ECCV, Cited by: §2.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433. Cited by: §1, §2, §3.1, §4.
  • [4] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra (2016) Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?. In EMNLP, Cited by: §2.
  • [5] J. A. Fodor and Z. W. Pylyshyn (1988) Connectionism and cognitive architecture: a critical analysis. Cognition 28 (1-2), pp. 3–71. Cited by: §1.
  • [6] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In CVPR, Cited by: §1.
  • [7] D. D. Hoffman and W. A. Richards (1984) Parts of recognition. Cognition 18 (1-3), pp. 65–96. Cited by: §1.
  • [8] D. A. Hudson and C. D. Manning (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §8.
  • [9] D. A. Hudson and C. D. Manning (2019) Gqa: a new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709. Cited by: §3.1.
  • [10] D. Huk Park, L. Anne Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach (2018) Multimodal explanations: justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8779–8788. Cited by: §2.
  • [11] Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh (2018) Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956. Cited by: §5, Figure 5, §7.
  • [12] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §6.1.
  • [13] T. Qiao, J. Dong, and D. Xu (2018) Exploring human-like attention supervision in visual question answering. In AAAI, Cited by: §1, §2.
  • [14] M. Ren, R. Kiros, and R. Zemel (2015) Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pp. 2953–2961. Cited by: §1.
  • [15] M. T. Ribeiro, C. Guestrin, and S. Singh (2019-07) Are red roses red? evaluating consistency of question-answering models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6174–6184. Cited by: §2.
  • [16] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. (2017) Grad-cam: visual explanations from deep networks via gradient-based localization.. In ICCV, Cited by: §2.
  • [17] R. R. Selvaraju, S. Lee, Y. Shen, H. Jin, S. Ghosh, L. Heck, D. Batra, and D. Parikh (2019-10) Taking a hint: leveraging explanations to make vision and language models more grounded. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §2.
  • [18] M. Shah, X. Chen, M. Rohrbach, and D. Parikh (2019) Cycle-consistency for robust visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6649–6658. Cited by: §2.
  • [19] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh (2016) Yin and Yang: balancing and answering binary visual questions. In CVPR, Cited by: §2.