It is well known that Visual Question Answering (VQA) problems is challenging due to the fact that it requires not only accuracy on extracting semantic knowledge from an image such as objects, attributes and relationships, but also a comprehensive ability to conduct reasoning over the knowledge set. Clearly a two-step task as it is, a natural question is which matters more? The answer seems clearer as more recent datasets came out followed by methods that perform reasonably well on them. As one of the earliest, the VQA  and VQA 2.0  datasets directly target QA on natural images, where the latter is more balanced than the former so that questions have less statistical bias that can be leveraged to give direct answers without even looking at the images. The CLEVR dataset  goes a little further by asking complex logical reasoning questions on image of simple shapes. It is well known as a benchmark to test a model’s ability to reason from questions towards answers.
While we have witnessed tremendous success on the CLEVR dataset where recent methods are able to have close-to-perfect performance (98.9% in  and 99.8% in  ), results of these methods on natural images are notably lower. What is the main cause, the intrinsic difficulty of perceptual ability on natural images, or the potentially more complex logical processes on these images? This paper provides some observations and insights to this question.
As shown in Figure 1, our model is consisted of two modules: feature extraction and knowledge reasoning. The latter takes the output of the former as input, i.e., knowledge of the given image and directly outputs the answer.
For the feature extraction module, we train separate detectors for object detection, attribute detection and relationship detection. We use Detectron  for objects, and build a similar system with 
for attributes and relationships. We use several techniques to alleviate the issues caused by large category spaces: 1) We remove plural words in the object categories, which increases mAP@50 from 8.42 to 10.08. The intuition is that turning plurals to singles does not change the semantic meaning but only has fine-grained visual difference, which is in fact of little importance in the GQA dataset; 2) For each output category, if another category is its hypernym, we output it as well since a true category always entails its hypernyms being true. We achieve this by checking the category in the WordNet tree and finding its fore-parent nodes along the upward path to the root. We do this for both object and attribute categories. Note that this is a step only during testing. During training we still treat hypernyms as equally distinguished categories. 3) For attributes, we separate them into two disjoint groups, one for the adjectives and one for the non-adjectives. This is because learning adjectives and non-adjectives requires a model to focus on different property of objects, e.g., adjectives usually describe colors, textures or sizes, while non-adjectives are often about materials or components. This separation is done by checking whether a word has non-empty adjective synsets. 4) For the predicates of relationships, we separate them into three groups: spatial, interaction and others. Interaction predicates can be filtered out by checking whether a word has non-empty verb synsets, while the other two groups are selected manually. Once the objects, attributes and relationships are detected, their labels are embedded into 300D vectors by GloVe. Including object CNN features provide by the GQA dataset, we have totally four types of features that provide various aspects of the knowledge about an image. For the visual reasoning module, we use Compositional Attention Networks for Machine Reasoning (MAC) 
as the backbone of our reasoning module. We use a late fusion strategy that feeds each of the four features into one independent branch with the MAC structure, then add up the output logits of the four branches followed by a Sigmoid on the sum to obtain a probability distribution over the dictionary of all answers, which is a common strategy adopted by many VQA systems.
and multi-stream cross attentional model (MS-CA) from TVQA. The detected features are extracted by our trained object, attribute and relationship detectors, while the statistical features are obtained by first counting the frequency of each word of all the questions about one image, then taking only those words with frequencies higher than a threshold (which is set as 10 in this paper). The MS-CA model is a strong baseline provided by  for video question answering, and we modify it for our visual reasoning based question answering. The purpose of this ablation is to compare the gaps between different features when the model is fixed, and compare the gaps between different models when the features are fixed. We can see that statistical features lead to results that are close to those from ground-truth features, which we believe is due to the fact that some questions can provide useful information for other questions, i.e., there are valuable facts hidden in many questions. The gap between detected features and statistical features are obviously larger than the one between statistical features and ground-truth features, indicating the intrinsic difficulty of learning to accurately recognize useful visual knowledge from natural images. Another comparison can be done between the two different models when features are fixed. The gaps between the two models for each of the three features are relatively smaller than the gaps between different features when the model is fixed, suggesting that the difference of reasoning models might not be as important as difference of features. This acts as a support to our claim that the bottleneck for success of visual question answering is in fact more on the inaccuracy of feature extraction than on the lack of reasoning ability.
|Features||Val Acc.||Test Acc.|
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: Visual question answering. In ICCV, 2015.
-  R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, 2017.
-  D. A. Hudson and C. D. Manning. Compositional attention networks for machine reasoning. ICLR, 2018.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
-  J. Lei, L. Yu, M. Bansal, and T. L. Berg. TVQA: localized, compositional video question answering. In EMNLP, 2018.
-  J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In EMNLP, 2014.
-  K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. In NIPS, 2018.
-  J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro. Graphical contrastive losses for scene graph generation. CVPR, 2019.