leverage visual features extracted from images and text features extracted from question to perform classification to obtain answers. Thus, visual and textual features serve as basic components which can directly impact the final performance. In this paper, we propose to improve the performance of VQA by extracting more powerful visual and text features.
For visual features, most existing works [6, 9, 16] adopt bottom-up-attention features released by , whose feature extractor is a Faster R-CNN object detector built upon a ResNet-101 backbone. We adopt more powerful backbones (i.e. ResNeXt-101, ResNeXt-152) to train stronger detectors. Some techniques (i.e. FPN, multi-scale training) that are useful to improve the accuracy of detectors can also help to boost the performance of VQA.
For text features, we build upon recent state-of-art techniques in the NLP community. Large-scale language models such as ELMO , GPT  and BERT , have shown excellent results for various NLP tasks in both token and sentence level. BERT uses masked language models to enable pre-trained deep bidirectional representations and allows the representation to fuse the right and left context. While in VQA model, to get the question answer, we need the token level features to contain questions’ contextual information to fuse with the visual tokens for the reasoning. So we adopt the BERT as our language mode.
Experiments on VQA 2.0 dataset shows the effectiveness of each component of our solution. Our final model achieves accuracy on test-standard split, which won the second place in the VQA Challenge 2019.
2 Feature Representation
|FaceBook pythia (ResNeXt-101)||512||✓||Glove||85.56||52.68||60.87||70.11|
|Ensemble (5 models)||-||-||-||89.65||58.53||65.27||74.55|
|Ensemble (20 models)||-||-||-||89.81||58.89||65.39||74.71|
|test-std||Ensemble (20 models)||-||-||-||89.81||58.36||65.69||74.89|
2.1 Visual Feature
Dataset. We adopt Visual Genome 1.2 as object detection dataset. Following the setting in , we adopt object classes and attribute classes as training categories. The dataset is divided into train, val, and test splits, which contain , and images, respectively. We train detectors on train split, and use val and test as validation set to tune parameters.
as backbone, and use parameters pretrained on ImageNet for initialization. We use RoIAlign  to wrap region features into fixed size, and embed them into -dim via two fully connected layer. Similar to , we extend a classification branch on the top of region feature, and utilize attribute annotations as additional supervision. Such attribute branch is only used to enhance feature representation ability in the training stage, and will be discarded in the feature extraction stage.
Feature. Given an image, we first feed it into the trained detector, and apply non-maximum suppression (NMS) on each category to remove duplicate bbox. Then we seek boxes with highest object confidence, and extract their -dim FC feature. These boxes with their features are considered as the representation of the given image.
2.2 Language Feature
The BERT model, introduced in , can be seen as a multi-layer bidirectional Transformer based on . The model consists of Embedding layers, Transformer blocks and self-attention heads, which has two different model size. For the base model, there are layers Transformer blocks, the hidden size is -dim, and the number of self-attention heads is . The total parameters are M. For the large model size, the model consists of Transformer blocks which hidden size is , and self-attention heads. And the model total parameters is M The model can process a single text sentence or a pair of text sentences (i.e.,[Question, Answer]) in one token sequence. To separate the pair of text sentence, we can add special token ([SEP]) between two sentences, add a learned sentence A embedding to every token of the first sentence and a sentence B embedding to every token of the second sentence. For VQA task, there is only one sentence, we only use the sentence A embedding.
Considering that the total parameter of VQA model is less than
M, we use the base BERT as our language model to extract question features. To get each token’s representation, we only use the hidden state corresponding to the last attention block features of the full sequence. Pre-trained BERT model has shown to be effective for boosting many natural language processing tasks, we adopt the base BERT uncased pre-train weight as our initial parameters.
3 VQA Model
Recent years, there are many VQA models, which have achieved surprising results. We adopt the Bilinear Attention Networks  (BAN) as our base model. The single model with eight-glimpse can get
on VQA2.0 test-dev subset. The BAN model uses Glove and GRU as the language model. And the language feature is a vector
. To improve the VQA model performance, we replace the language model with base BERT and modify the BAN language input feature dimension. To Train the BAN with BERT, we use all settings from BAN, but set the max epoch iswith costing learn rate scheduler. To use the base BERT pre-trained parameters we set the learning rate of the BERT module to .
4.1 Ablation Experiments
Table 1 shows all our ablation study on each component, including attribute head, FPN dimension, language model. From and row, we can find that the attribute head can bring more than improvement to the final performance, which shows the effectiveness of such module. From to row, we find that BERT can boost the performance by more than point improvement stably. Besides, increasing FPN dimension and adopting multi-scale training can both slightly improve the VQA accuracy.
4.2 Comparison with Others
We select BAN trained on Bottom-up-attention and Facebook features as baselines. Our single model result achieves accuracy on test-dev
split, which significantly outperforms all existing state-of-the-arts. We also ensemble several models we trained by averaging their probabilities output. The result bymodels’ ensemble achieves and accuracy on VQA test-dev and test-std splits, respectively. Such result won the second place in the VQA Challenge 2019.
We have shown that for VQA task, the representation capacity of both visual and textual features is critical for the final performance.
Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §1.
-  (2018) Bottom-up and top-down attention for image captioning and visual question answering. In , pp. 6077–6086. Cited by: §2.1, §2.1, §2.1.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.1.
-  (2018) Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31, pp. 1571–1581. Cited by: §1, §1, §3.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §2.1.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §2.1.
-  (2018) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. arXiv preprint arXiv:1812.05252. Cited by: §1, §1.
-  (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1.
Improving language understanding with unsupervised learning. Technical report Technical report, OpenAI. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.1.
-  (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232. Cited by: §2.1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.2.
Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §2.1.
-  (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29 (12), pp. 5947–5959. Cited by: §1, §1.