Top 5 predictions from our model and their probabilities for an example image/question pair. On the right we visualize the corresponding attention distribution produced by the model.
Deep neural networks in the last few years have made dramatic impact in computer vision and natural language processing fields. We are now able to build models that recognize objects in the images with high accuracy[15, 26, 9]. But we are still far from human level understanding of images. When we as humans look at images we don’t just see objects but we also understand how objects interact and we can tell their state and properties. Visual question answering (VQA)  is particularly interesting because it allows us to understand what our models truly see. We present the model with an image and a question in the form of natural language and the model generates an answer again in the form of natural language.
A related and more throughly researched task to VQA is image caption generation [31, 28], where the task is to generate a representative description of an image in natural language. A clear advantage of VQA over caption generation is that evaluating a VQA model is much easier. There is not a unique caption that can describe an image. Moreover, it is rather easy to come up with a single caption that more or less holds for a large collection of images. There is no way to tell what the model actually understands from the image based on a generic caption. Some previous work have been published that tried to mitigate this problem by providing dense  or unambiguous captions , but this problem is inherently less severe with VQA task. It is always possible to ask very narrow questions forcing the model to give a specific answer. For these reasons we believe VQA is a good proxy task for creating rich representations for modeling language and vision.
Some novel and interesting approaches [6, 22] have been published in the last few years on visual question answering that showed promising results. However, in this work, we show that a relatively simple architecture (compared to the recent works) when trained carefully bests state the art. Figure 2
provides a high level overview of our model. To summarize, our proposed model uses long short-term memory units (LSTM) to encode the question, and a deep residual network  to compute the image features. A soft attention mechanism similar to 
is utilized to compute multiple glimpses of image features based on the state of the LSTM. A classifier than takes the image feature glimpses and the final state of the LSTM as input to produce probabilities over a fixed set of most frequent answers. On VQA 1.0 open ended challenge, our model achieves 64.6% accuracy on the test-standard set without using additional data, an improvement of 0.4% over state of the art, and on newly released VQA 2.0 , our model scores 59.7% on validation set outperforming best reported results by 0.5%.
This paper proves once again that when it comes to training neural networks the devil is in the details .
2 Related work
In this section we provide an overview of related work.
for the first time showed great success on applying a deep CNN on large scale ImageNet dataset achieving a dramatic improvement over state of the art methods that used hand designed features. In the recent years researchers have been hard at work training deeper , very deep , and even deeper 
neural networks. While success of neural networks are commonly attributed to larger datasets and more compute power, there are a lot of details that we know and consider now that were not known just a few years ago. These include choice of activation function, initialization , optimizer , and regularization . As we show in this paper at times getting the details right is more important than the actual architecture.
When it comes to design of deep neural networks, very few ideas have been consistently found advantageous across different domains. One of these ideas is notion of attention [20, 28], which enables deep neural networks to extract localized features from input data.
Another neural network model that we take advantage of in this work is Long Short-Term Memory (LSTM) 
. LSTMs have been widely adopted by machine learning researchers in the recent years and have shown oustanding results on a wide range of problems from machine translation to speech recognition .
All of these ideas have already been applied to visual question answering task. In fact the model that we describe in this work is very similar to stacked attention networks , nevertheless we show significant improvement over their result (
on VQA 1.0 dataset). While more recently much more complex and expensive attention models have been explored[6, 22, 18] their advantage is unclear in the light of the results reported in this paper.
Figure 2 shows an overview of our model. In this section we formalize the problem and explain our approach in more detail.
We treat visual question answering task as a classification problem. Given an image and a question
in the form of natural language we want to estimate the most likely answerfrom a fixed set of answers based on the content of the image.
where . The answers are chosen to be the most frequent answers from the training set.
3.1 Image embedding
We use a pretrained convolutional neural network (CNN) model based on residual network architecture  to compute a high level representation of the input image .
is a three dimensional tensor from the last layer of the residual network before the final pooling layer with dimensions. We furthermore perform normalization on the depth (last) dimension of image features which enhances learning dynamics.
3.2 Question embedding
We tokenize and encode a given question into word embeddings where , is the length of the distributed word representation, and is the number of words in the question. The embeddings are then fed to a long short-term memory (LSTM) .
We use the final state of the LSTM to represent the question.
3.3 Stacked attention
Similar to , we compute multiple attention distributions over the spatial dimensions of the image features.
Each image feature glimpse is the weighted average of image features over all the spatial locations . The attention weights are normalized separately for each glimpse .
In practice is modeled with two layers of convolution. Consequently ’s share parameters in the first layer. We solely rely on different initializations to produce diverse attention distributions.
Finally we concatenate the image glimpses along with the LSTM state and apply nonlinearities to produce probabilities over answer classes.
in practice is modeled with two fully connected layers.
Our final loss is defined as follows.
Note that we average the log-likelihoods over all the correct answers .
4.1.1 Vqa 1.0
We evaluate our model on both balanced and unbalanced versions of VQA dataset. VQA 1.0  is consisted of 204,721 images form the MS COCO dataset . We evaluate our models on the real open ended challenge which consists of 614,163 questions and 6,141,630 answers. The dataset comes with predefined train, validation, and test splits. There is also a 25% subset of the the test set which is referred to as test-dev split. For most of experiments we used the train set as training data and reported the results on the validation set. To be comparable to prior work we additionally train our default model on train and val set and report the results on test set.
4.1.2 Vqa 2.0
We also evaluate our model on the more recent VQA 2.0  which is consisted of 658,111 questions and 6,581,110 answers. This version of the dataset is more balanced in comparison to VQA 1.0. Specifically for every question there are two images in the dataset that result in two different answers to the question. At this point only the train and validation sets are available. We report the results on validation set after training on train set.
4.2 Evaluation metric
We evaluate our models on the open ended task of VQA challenge with the provided accuracy metric.
where are the correct answers provided by the user and . Intuitively, we consider an answer correct if at least three annotators agree on the answer. To get some level of robustness we compute the accuracy over all 10 choose 9 subsets of ground truth answers and average.
|No dropout on FC/Conv layers||32.78||38.19||49.02||58.63||57.90||57.42||57.30||56.98|
|No dropout on LSTM layers||45.85||51.55||55.75||57.63||59.60||59.79||59.95||59.80|
|With positional features||33.26||41.37||55.36||57.95||59.75||60.44||61.02||61.09|
|Word embedding size: 100||39.53||50.24||53.94||56.74||58.92||59.96||60.75||60.90|
|Word embedding size: 300 (default)||37.16||46.96||55.07||58.12||59.76||60.65||60.94||60.95|
|Word embedding size: 500||37.21||47.15||55.44||58.43||59.98||60.60||61.01||61.04|
|LSTM state size: 512||46.59||51.20||55.33||57.96||59.46||60.31||60.79||61.09|
|LSTM state size: 1024 (default)||37.16||46.96||55.07||58.12||59.76||60.65||60.94||60.95|
|LSTM state size: 2048||33.24||39.11||50.86||57.48||59.75||60.65||60.93||60.80|
|LSTM state size: 1024 1024||37.78||48.19||54.28||57.20||59.34||60.22||60.62||60.75|
|Attention size: 512 1||36.54||45.74||54.23||57.42||59.46||60.22||60.85||60.96|
|Attention size: 512 2 (default)||37.16||46.96||55.07||58.12||59.76||60.65||60.94||60.95|
|Attention size: 512 3||36.26||45.16||55.22||57.96||59.77||60.60||60.87||61.12|
|Attention size: 1024 1||45.60||50.72||54.61||57.57||59.52||60.46||60.92||60.92|
|Attention size: 1024 2||35.04||42.72||55.56||58.03||59.66||60.54||61.14||61.10|
|Classifier size: 3000||30.19||43.12||53.38||56.18||57.82||58.25||58.24||58.12|
|Classifier size: 1024 3000 (default)||37.16||46.96||55.07||58.12||59.76||60.65||60.94||60.95|
|Classifier size: 2048 3000||48.28||52.57||56.02||58.51||59.96||60.46||60.84||60.95|
|Classifier size: 1024 1024 3000||44.51||49.53||53.25||55.95||57.59||58.83||60.05||60.66|
In this section we describe the details of our default baseline as well as its mutations.
In all of the baselines input images are scaled while preserving aspect ratio and center cropped to dimensions. We found stretching the image to harm the performance of the model. Image features are extracted from pretrained 152 layer ResNet  model. We take the last layer before the average pooling layer (of size ) and perform normalization in the depth dimension.
The input question is tokenized and embedded to a
dimensional vector. The embeddings are passed throughnonlinearity before feeding to the LSTM. The state size of LSTM layer is set to . Per example dynamic unrolling is used to allow for questions of different length, although we cap maximum length of the questions at words.
To compute attention over image features, we concatenate tiled LSTM state with image features over the depth dimension and pass through a dimensional convolution layer of depth
followed by ReLU nonlinearity. The output feature is passed through another convolution of depth followed by softmax over spatial dimensions to compute attention distributions. We use these distributions to compute two image glimpses by computing the weighted average of image features.
We further concatenate the image glimpses with the state of the LSTM and pass through a fully connected layer of size with ReLU nonlinearity. The output is fed to a linear layer of size followed by softmax to produce probabilities over most frequent classes.
We only consider top most frequent answers in our classifier. Other answers are ignored and do not contribute to the loss during training. This covers of the answers in the validation set in VQA dataset .
We use dropout of on input features of all layers including the LSTM, convolutions, and fully connected layers.
We optimize this model with Adam optimizer  for steps with batch size of . We use exponential decay to gradually decrease the learning rate according to the following equation.
The initial learning rate is set to , and the decay steps is set to . We set and .
During training CNN parameters are kept fixed. The rest of the parameters are initialized as suggested by Glorot et al. .
No norm: ResNet features are not normalized.
No dropout on FC/Conv: Dropout is not applied to the inputs of fully connected and convolution layers.
No dropout on LSTM: Dropout is not applied to the inputs of LSTM layers.
No attention: Instead of using soft-attention we perform average spatial pooling before feeding image features to the classifier.
Sampled loss: Instead of averaging the log-likelihood of correct answers we sample one answer at a time.
With positional features: Image features are augmented with and coordinates of each cell along the depth dimension producing a tensor of size .
Bidirectional LSTM: We use a bidirectional LSTM to encode the question.
Word embedding size: We try word embeddings of different sizes including , (default), and .
LSTM state size: We explore different configurations of LSTM state sizes, this include a one layer LSTM of size , (default), and or a stacked two layer LSTM of size .
Attention size: Different attention configurations are explored. First number indicates the size of first convolution layer and the second number indicates the number of attention glimpses.
Classifier size: By default classifier is consisted of a fully connected layer of size with ReLU nonlinearity followed by a dimensional linear layer followed by softmax. We explore shallower, deeper, and wider alternatives.
normalization of image features improved learning dynamics leading to significantly better accuracy while reducing the training time.
We observed that applying dropout on multiple layers (including fully connected layers, convolutions, and LSTMs) is crucial to avoid over-fitting on this dataset.
As widely reported we confirm that using soft-attention significantly improves the accuracy of the model.
Different word embedding sizes and LSTM configurations were explored but we found it to be not a major factor. A larger embedding size with a smaller LSTM seemed to work best.
Some of the previous works such as  had used the sampling loss, which we found to be leading to significantly worse results and longer training time.
Contrary to results reported by  we found using stacked attentions to only marginally improve the result.
We found a two layer deep classifier to be significantly better than a single layer, adding more layers or increasing the width did not seem to improve the results.
|VQA team ||80.5||36.8||43.1||57.8||80.6||36.5||43.7||58.2|
|SAN (VGG) ||79.3||36.6||46.1||58.7||-||-||-||58.9|
|NMN (VGG) ||81.2||38.0||44.0||58.6||-||-||-||58.7|
|ACK (VGG) ||81.0||38.4||45.2||59.2||81.1||37.1||45.8||59.4|
|DMN+ (VGG) ||80.5||36.8||48.3||60.3||-||-||-||60.4|
|MRN (ResNet) ||82.3||38.8||49.3||61.7||82.4||38.2||49.4||61.8|
|HieCoAtt (ResNet) ||79.7||38.7||51.7||61.8||-||-||-||62.1|
|DAN (VGG) ||82.1||38.2||50.2||62.0||-||-||-||-|
|RAU (ResNet) ||81.9||39.0||53.0||63.3||81.7||38.2||52.8||63.2|
|MCB (ResNet) ||82.2||37.7||54.8||64.2||-||-||-||-|
|DAN (ResNet) ||83.0||39.1||53.9||64.3||82.8||38.1||54.0||64.2|
5.2 Comparison to state of the art
Table 2 shows the performance of our model on VQA 1.0 dataset. We trained our model on train and validation set and tested the performance on test-standard set. Our model achieves an overall accuracy of on the test-standard set, outperforming best previously reported results by . All the parameters here are the same as the default model.
While architecturally our default model is almost identical to , some details are different. For example they use the VGG  model, while we use ResNet  to compute image features. They do not mention normalization of image features which found to be crucial to reducing training time. They use SGD optimizer with momentum , while we found that Adam  generally leads to faster convergence.
We also reported our results on VQA 2.0 dataset 3. At this point we only have access to train and validation splits for this dataset. So we trained the same model on the training set and evaluated the model on the validation set. Overall our model achieves accuracy on the validation set which is about higher than best previously reported results.
In this paper we presented a new baseline for visual question answering task that outperforms previously reported results on VQA 1.0 and VQA 2.0 datasets. Our model is architecturally very simple and in essence very similar to the models that were tried before, nevertheless we show once the details are done right this model outperforms all the previously reported results.
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein.
Neural module networks.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In International Journal of Computer Vision, 2015.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
-  A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, 2016.
-  X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. CoRR, abs/1612.00837, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997.
J. Johnson, A. Karpathy, and L. Fei-Fei.
Densecap: Fully convolutional localization networks for dense captioning.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4565–4574, 2016.
-  J.-H. Kim, S.-W. Lee, D. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang. Multimodal residual learning for visual qa. In Advances in Neural Information Processing Systems, pages 361–369, 2016.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In ISCAS, 2010.
-  T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
-  J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
-  H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. CoRR, abs/1611.00471, 2016.
-  H. Noh and B. Han. Training recurrent answering units with joint loss minimization for vqa. CoRR, abs/1606.03647, 2016.
H. Sak, A. W. Senior, and F. Beaufays.
Long short-term memory recurrent neural network architectures for large scale acoustic modeling.In INTERSPEECH, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  Q. Wu, P. Wang, C. Shen, A. R. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In ICML, 2016.
-  K. Xu, J. Ba, J. R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.