Learning Rich Image Region Representation for Visual Question Answering

10/29/2019 ∙ by Bei Liu, et al. ∙ 0

We propose to boost VQA by leveraging more powerful feature extractors by improving the representation ability of both visual and text features and the ensemble of models. For visual feature, some detection techniques are used to improve the detector. For text feature, we adopt BERT as the language model and find that it can significantly improve VQA performance. Our solution won the second place in the VQA Challenge 2019.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of Visual Question Answering (VQA) requires our model to answer text question based on the input image. Most works [6, 9, 16]

leverage visual features extracted from images and text features extracted from question to perform classification to obtain answers. Thus, visual and textual features serve as basic components which can directly impact the final performance. In this paper, we propose to improve the performance of VQA by extracting more powerful visual and text features.

For visual features, most existing works [6, 9, 16] adopt bottom-up-attention features released by [1], whose feature extractor is a Faster R-CNN object detector built upon a ResNet-101 backbone. We adopt more powerful backbones (i.e. ResNeXt-101, ResNeXt-152) to train stronger detectors. Some techniques (i.e. FPN, multi-scale training) that are useful to improve the accuracy of detectors can also help to boost the performance of VQA.

For text features, we build upon recent state-of-art techniques in the NLP community. Large-scale language models such as ELMO [10], GPT [11] and BERT [4], have shown excellent results for various NLP tasks in both token and sentence level. BERT uses masked language models to enable pre-trained deep bidirectional representations and allows the representation to fuse the right and left context. While in VQA model, to get the question answer, we need the token level features to contain questions’ contextual information to fuse with the visual tokens for the reasoning. So we adopt the BERT as our language mode.

Experiments on VQA 2.0 dataset shows the effectiveness of each component of our solution. Our final model achieves accuracy on test-standard split, which won the second place in the VQA Challenge 2019.

2 Feature Representation

Split Backbone FPN dim Attribute Language Yes/No Num Others Score
test-dev Bottom-up-attention (ResNet-101) - Glove 85.42 54.04 60.52 70.04
FaceBook pythia (ResNeXt-101) 512 Glove 85.56 52.68 60.87 70.11
ResNeXt-101 256 Glove 83.1 53.0 55.62 66.64
ResNeXt-101 256 Glove 85.44 54.2 60.87 70.23
ResNeXt-152 256 Glove 86.42 55.11 61.88 71.22
ResNeXt-152 512 Glove 86.59 56.44 62.06 71.53
ResNeXt-152 (ms-train) 256 Glove 86.46 56.37 62.24 71.55
ResNeXt-152 (ms-train) 512 Glove 86.54 56.90 62.31 71.68
ResNeXt-152 256 BERT 88.00 56.28 62.90 72.48
ResNeXt-152 512 BERT 88.15 56.79 62.98 72.64
ResNeXt-152 (ms-train) 256 BERT 88.14 56.74 63.3 72.79
ResNeXt-152 (ms-train) 512 BERT 88.18 55.35 63.16 72.58
Ensemble (5 models) - - - 89.65 58.53 65.27 74.55
Ensemble (20 models) - - - 89.81 58.89 65.39 74.71
test-std Ensemble (20 models) - - - 89.81 58.36 65.69 74.89
Table 1: Experiment results on VQA 2.0 test-dev and test-std splits. We adopt BAN as VQA model in all settings. The first two rows indicate the results of models we train on released features. “ms-train” means using multi-scale strategy in detectors training.

2.1 Visual Feature

Existing works [2, 13] show that detection features are more powerful than classification features on VQA task, therefore we train object detectors on large-scale dataset for feature extraction.

Dataset. We adopt Visual Genome 1.2[7] as object detection dataset. Following the setting in [2], we adopt object classes and attribute classes as training categories. The dataset is divided into train, val, and test splits, which contain , and images, respectively. We train detectors on train split, and use val and test as validation set to tune parameters.

Detector. We follow the pipeline of Faster R-CNN [12] to build our detectors. We adopt ResNeXt [15] with FPN [8]

as backbone, and use parameters pretrained on ImageNet

[3] for initialization. We use RoIAlign [5] to wrap region features into fixed size, and embed them into -dim via two fully connected layer. Similar to [2], we extend a classification branch on the top of region feature, and utilize attribute annotations as additional supervision. Such attribute branch is only used to enhance feature representation ability in the training stage, and will be discarded in the feature extraction stage.

Feature. Given an image, we first feed it into the trained detector, and apply non-maximum suppression (NMS) on each category to remove duplicate bbox. Then we seek boxes with highest object confidence, and extract their -dim FC feature. These boxes with their features are considered as the representation of the given image.

2.2 Language Feature

The BERT model, introduced in [4], can be seen as a multi-layer bidirectional Transformer based on [14]. The model consists of Embedding layers, Transformer blocks and self-attention heads, which has two different model size. For the base model, there are layers Transformer blocks, the hidden size is -dim, and the number of self-attention heads is . The total parameters are M. For the large model size, the model consists of Transformer blocks which hidden size is , and self-attention heads. And the model total parameters is M The model can process a single text sentence or a pair of text sentences (i.e.,[Question, Answer]) in one token sequence. To separate the pair of text sentence, we can add special token ([SEP]) between two sentences, add a learned sentence A embedding to every token of the first sentence and a sentence B embedding to every token of the second sentence. For VQA task, there is only one sentence, we only use the sentence A embedding.

Considering that the total parameter of VQA model is less than

M, we use the base BERT as our language model to extract question features. To get each token’s representation, we only use the hidden state corresponding to the last attention block features of the full sequence. Pre-trained BERT model has shown to be effective for boosting many natural language processing tasks, we adopt the base BERT uncased pre-train weight as our initial parameters.

3 VQA Model

Recent years, there are many VQA models, which have achieved surprising results. We adopt the Bilinear Attention Networks [6] (BAN) as our base model. The single model with eight-glimpse can get

on VQA2.0 test-dev subset. The BAN model uses Glove and GRU as the language model. And the language feature is a vector

. To improve the VQA model performance, we replace the language model with base BERT and modify the BAN language input feature dimension. To Train the BAN with BERT, we use all settings from BAN, but set the max epoch is

with costing learn rate scheduler. To use the base BERT pre-trained parameters we set the learning rate of the BERT module to .

4 Experiments

4.1 Ablation Experiments

Table 1 shows all our ablation study on each component, including attribute head, FPN dimension, language model. From and row, we can find that the attribute head can bring more than improvement to the final performance, which shows the effectiveness of such module. From to row, we find that BERT can boost the performance by more than point improvement stably. Besides, increasing FPN dimension and adopting multi-scale training can both slightly improve the VQA accuracy.

4.2 Comparison with Others

We select BAN trained on Bottom-up-attention and Facebook features as baselines. Our single model result achieves accuracy on test-dev

split, which significantly outperforms all existing state-of-the-arts. We also ensemble several models we trained by averaging their probabilities output. The result by

models’ ensemble achieves and accuracy on VQA test-dev and test-std splits, respectively. Such result won the second place in the VQA Challenge 2019.

5 Conclusion

We have shown that for VQA task, the representation capacity of both visual and textual features is critical for the final performance.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In CVPR, Cited by: §1.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: §2.1, §2.1, §2.1.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.1.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.2.
  • [5] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.1.
  • [6] J. Kim, J. Jun, and B. Zhang (2018) Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31, pp. 1571–1581. Cited by: §1, §1, §3.
  • [7] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §2.1.
  • [8] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §2.1.
  • [9] G. Peng, H. Li, H. You, Z. Jiang, P. Lu, S. Hoi, and X. Wang (2018) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. arXiv preprint arXiv:1812.05252. Cited by: §1, §1.
  • [10] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1.
  • [11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018)

    Improving language understanding with unsupervised learning

    Technical report Technical report, OpenAI. Cited by: §1.
  • [12] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.1.
  • [13] D. Teney, P. Anderson, X. He, and A. van den Hengel (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232. Cited by: §2.1.
  • [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.2.
  • [15] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017)

    Aggregated residual transformations for deep neural networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §2.1.
  • [16] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29 (12), pp. 5947–5959. Cited by: §1, §1.