Visual Question Answering (VQA) aims to select an answer given an image and a related question 10, 3, 4] are often severely biased and lack semantic compositionality, which makes it hard to diagnose model performance. To handle this, GQA dataset  is proposed. It is more balanced and contains 22M questions that require a diverse set of reasoning skills to answer. In the last few years, some novel and interesting approaches have been published to solve the VQA task. For example, the relation-based methods[11, 13], attention-based methods [7, 14], and module-based methods [6, 2]. In this work, we use a relatively simple architecture as our baseline with three parts, namely: multi-source features, fine-grained encoder, and weighted ensemble. Each part significantly improves performance. Fig. 1 provides an overview of our architecture. Firstly, we consider using multi-source features. For images, we use three kinds of features: spatial features, detection features, and bounding box features. For questions, we use both question strings and programs. Secondly, we use Bayesian GRU to encode the question instead of traditional GRU encoder. Thirdly, we use score-weighted ensemble to combine several models. We perform detailed ablation studies of each component which shed lights on developing a strong baseline on real-world visual question answering task. Our model won the sixth place in the 2019 GQA Challenge.
2 Multi-source features
To extract features of various aspects of the image, we use three types of features: detection features, spatial features, and bounding box features. The three features are introduced one-by-one below.
2.1 Multi-source image features
2.1.1 Incorporating better detection features
Better detection features often help better understanding of images. Here we try three detection features: objects features, bottom-up-attention features, and Pythia features. All of them have the same size of 100*2048. The official GQA dataset  provides object features. Bottom-up-attention features are proposed by  who won the first place in the 2017 VQA Challenge. Pythia features are provided by , who is the VQA 2018 challenge winner.
|Baseline with object features||62.64|
|Baseline with bottom-up-attention features||65.44|
|Baseline with pythia features||65.99|
As we see in Tab 1, Pythia features perform better than bottom-up-attention features, and they have a significant gain than object features for about 3%.
2.1.2 Adding spatial features
Some previous work believes that spatial features and detection features may provide complementary information . In this work, we feed these two features for the two separate pipelines and finally combine the output. As shown in Tab 2, using both detection and spatial features improves the performance for 1%.
|Baseline with detection features||62.64|
|Baseline with detection and spatial features||63.64|
2.1.3 Adding bounding box features
One of the drawbacks of the attention method is that it ignores the position information of the objects. Here, We normalize the coordinates of the center point of the bounding box as positional information. Similarly, we normalize the length and width of the bounding box as size information. As shown in Tab 3, using positional information improves performance by 1.13%. However, the performance gains from size information are not very significant.
|Baseline with position features||63.71|
|Baseline with position and size features||63.82|
2.2 Multi-source question features
We use two different ways to represent the semantic meaning of the question. First, we directly use GRU to encode the feature representation of the question by feeding in the embedding of words. Second, we develop a VQA domain specific grammar and train a structure prediction model to translate each natural language question into a semantic program which implies the necessary steps for deriving the answer. We then use GRU to encode the program as a feature . At last, we concatenate and to form the final representation of the question. The adding of program representation yields about 1% accuracy improvement on the GQA test split.
3 Fine-grained encoder
Here, we use Bayesian GRU to encode the question instead of traditional GRU encoder. Tab. 4 shows the performance of different question encoder. As we can see, using baysian GRU improves the performance of about 0.93%.
|Baseline with gru||58.63|
|Baseline with elmo||59.05|
|Baseline with baysian gru||59.56|
4 Weighted ensemble
Followed by the early works like [1, 12], we use the common practice of ensembling several models to obtain better performance. We choose the best ones of all settings above and try different weights when summing the prediction scores. Both the average ensembling and the best-weighted ensembling results on test-dev splits and the test splits show in Tab. 5. This weighted ensembling strategy improves performance by 1.30% than the best single model.
-  Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, volume 3, page 6, 2018.
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein.
Learning to compose neural networks for question answering.In NAACL, pages 1545–1554, 2016.
-  Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
-  Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, volume 1, page 9, 2017.
-  Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv:1902.09506 [cs], Feb. 2019.
-  J. Johnson, B. Hariharan, L. v d Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and Executing Programs for Visual Reasoning. In ICCV, pages 3008–3017, Oct. 2017.
-  Jin-Hwa Kim, Kyoung-Woon On, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard product for low-rank bilinear pooling. In ICLR, 2017.
-  Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering. In AAAI, 2018.
-  Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, pages 1682–1690, 2014.
-  Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. In NIPS, pages 2953–2961, 2015.
-  Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017.
-  Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. arXiv:1904.08920 [cs], Apr. 2019.
-  Chenfei Wu, Jinlai Liu, Xiaojie Wang, and Xuan Dong. Chain of Reasoning for Visual Question Answering. In NIPS, pages 273–283, 2018.
-  Chenfei Wu, Jinlai Liu, Xiaojie Wang, and Xuan Dong. Object-Difference Attention: A Simple Relational Attention for Visual Question Answering. In ACMMM, pages 519–527. ACM, 2018.