Deep Reason: A Strong Baseline for Real-World Visual Reasoning

05/24/2019 ∙ by Chenfei Wu, et al. ∙ Microsoft Peking University 0

This paper presents a strong baseline for real-world visual reasoning (GQA), which achieves 60.93 large dataset with 22M questions involving spatial understanding and multi-step inference. To help further research in this area, we identified three crucial parts that improve the performance, namely: multi-source features, fine-grained encoder, and score-weighted ensemble. We provide a series of analysis on their impact on performance.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual Question Answering (VQA) aims to select an answer given an image and a related question [9]

. It requires both scene understanding in computer vision and semantic understanding in natural language processing. However, previous VQA datasets 

[10, 3, 4] are often severely biased and lack semantic compositionality, which makes it hard to diagnose model performance. To handle this, GQA dataset [5] is proposed. It is more balanced and contains 22M questions that require a diverse set of reasoning skills to answer. In the last few years, some novel and interesting approaches have been published to solve the VQA task. For example, the relation-based methods[11, 13], attention-based methods [7, 14], and module-based methods [6, 2]. In this work, we use a relatively simple architecture as our baseline with three parts, namely: multi-source features, fine-grained encoder, and weighted ensemble. Each part significantly improves performance. Fig. 1 provides an overview of our architecture. Firstly, we consider using multi-source features. For images, we use three kinds of features: spatial features, detection features, and bounding box features. For questions, we use both question strings and programs. Secondly, we use Bayesian GRU to encode the question instead of traditional GRU encoder. Thirdly, we use score-weighted ensemble to combine several models. We perform detailed ablation studies of each component which shed lights on developing a strong baseline on real-world visual question answering task. Our model won the sixth place in the 2019 GQA Challenge.

Figure 1: The overall structure of the proposed model for solving the GQA task. We use multi-source features, fine-grained encoder, and weighted ensemble to improve the performance.

2 Multi-source features

To extract features of various aspects of the image, we use three types of features: detection features, spatial features, and bounding box features. The three features are introduced one-by-one below.

2.1 Multi-source image features

2.1.1 Incorporating better detection features

Better detection features often help better understanding of images. Here we try three detection features: objects features, bottom-up-attention features, and Pythia features. All of them have the same size of 100*2048. The official GQA dataset [5] provides object features. Bottom-up-attention features are proposed by [1] who won the first place in the 2017 VQA Challenge. Pythia features are provided by [12], who is the VQA 2018 challenge winner.

Models Validation
Baseline with object features 62.64
Baseline with bottom-up-attention features 65.44
Baseline with pythia features 65.99
Table 1: Study better detection features.

As we see in Tab 1, Pythia features perform better than bottom-up-attention features, and they have a significant gain than object features for about 3%.

2.1.2 Adding spatial features

Some previous work believes that spatial features and detection features may provide complementary information [8]. In this work, we feed these two features for the two separate pipelines and finally combine the output. As shown in Tab 2, using both detection and spatial features improves the performance for 1%.

Models Validation
Baseline with detection features 62.64
Baseline with detection and spatial features 63.64
Table 2: Study adding spatial features.

2.1.3 Adding bounding box features

One of the drawbacks of the attention method is that it ignores the position information of the objects. Here, We normalize the coordinates of the center point of the bounding box as positional information. Similarly, we normalize the length and width of the bounding box as size information. As shown in Tab 3, using positional information improves performance by 1.13%. However, the performance gains from size information are not very significant.

Models Validation
Baseline 62.64
Baseline with position features 63.71
Baseline with position and size features 63.82
Table 3: Study adding bound box features.

2.2 Multi-source question features

We use two different ways to represent the semantic meaning of the question. First, we directly use GRU to encode the feature representation of the question by feeding in the embedding of words. Second, we develop a VQA domain specific grammar and train a structure prediction model to translate each natural language question into a semantic program which implies the necessary steps for deriving the answer. We then use GRU to encode the program as a feature . At last, we concatenate and to form the final representation of the question. The adding of program representation yields about 1% accuracy improvement on the GQA test split.

3 Fine-grained encoder

Here, we use Bayesian GRU to encode the question instead of traditional GRU encoder. Tab. 4 shows the performance of different question encoder. As we can see, using baysian GRU improves the performance of about 0.93%.

Models Validation
Baseline with gru 58.63
Baseline with elmo 59.05
Baseline with baysian gru 59.56
Table 4: Study Fine-grained encoder.

4 Weighted ensemble

Followed by the early works like [1, 12], we use the common practice of ensembling several models to obtain better performance. We choose the best ones of all settings above and try different weights when summing the prediction scores. Both the average ensembling and the best-weighted ensembling results on test-dev splits and the test splits show in Tab. 5. This weighted ensembling strategy improves performance by 1.30% than the best single model.

Models Test-dev Test
Average Ensemble 80.08 60.73
Weighted Ensemble 81.39 60.93
Table 5: Study different ensemble strategies.


  • [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, volume 3, page 6, 2018.
  • [2] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein.

    Learning to compose neural networks for question answering.

    In NAACL, pages 1545–1554, 2016.
  • [3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
  • [4] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, volume 1, page 9, 2017.
  • [5] Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv:1902.09506 [cs], Feb. 2019.
  • [6] J. Johnson, B. Hariharan, L. v d Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and Executing Programs for Visual Reasoning. In ICCV, pages 3008–3017, Oct. 2017.
  • [7] Jin-Hwa Kim, Kyoung-Woon On, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard product for low-rank bilinear pooling. In ICLR, 2017.
  • [8] Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering. In AAAI, 2018.
  • [9] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, pages 1682–1690, 2014.
  • [10] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. In NIPS, pages 2953–2961, 2015.
  • [11] Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017.
  • [12] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. arXiv:1904.08920 [cs], Apr. 2019.
  • [13] Chenfei Wu, Jinlai Liu, Xiaojie Wang, and Xuan Dong. Chain of Reasoning for Visual Question Answering. In NIPS, pages 273–283, 2018.
  • [14] Chenfei Wu, Jinlai Liu, Xiaojie Wang, and Xuan Dong. Object-Difference Attention: A Simple Relational Attention for Visual Question Answering. In ACMMM, pages 519–527. ACM, 2018.