Deep learning has proven to be applicable to several problems and data modalities (e.g. object detection, speech recognition, machine translation, etc.). Furthermore, it has been able to set new records, beating the state of the art in several artificial intelligence areas. Now, new machine learning problems may be tackled, taking profit from the capabilities of deep learning methods for combining multiple data modalities and be end-to-end trainable, thus, having potential to enable new research and application areas. Some multimodal problems are image captioning , video captioning  or multimodal machine translation and crosslingual image captioning . In this work, we address the challenging Visual Question Answering (VQA)  problem.
From the visual modality perspective, a clear proposal for processing images are Convolutional Neural Networks (CNNs) 
. CNNs are a powerful tool, not only for image classification, but also for feature extraction. Nevertheless, they are not fully scale and rotation invariant, unless they have been specifically trained with enough varied examples. Furthermore, this invariance problem gets more acute in scene images, which are composed of multiple elements at possibly different rotations and scales. In order to tackle this problem, Liu proposed in  a Kernelized approach for learning a rich representation for images composed of multiple objects in any possible rotation and scale.
From the textual modality perspective, Recurrent Neural Networks (RNNs) have shown to be effective sequence modelers. The use of gated units, such as Long Short-Term Memory (LSTM), allows to properly process long sequences. In the last years, LSTM networks have been used in a wide variety of tasks, such as machine translation  or image and video captioning [18, 12].
After the appearance of the VQA dataset  and the organization of the VQA Challenge, several models appeared addressing this problem. Some notable examples are the ones by Kim et al. , where image and question was separately described by a CNN and by a RNN, and then a Multimodal Residual Network (MRN) was used for combining both modalities. Fukui et al.  used a CNN for describing the image and a two-layered LSTM for the question; followed by a Multimodal Compact Bilinear Pooling (MCB) for fusion. Nam et al. , after describing the input image and question, applied a powerful Dual Attention Network (DAN) for fusing both modalities.
In this work, we propose a model for open-ended VQA which uses the most powerful state-of-the-art methods for image and text characterization. More precisely, we use a Kernelized CNN (KCNN) for image characterization, which takes profit from detecting and characterizing all objects in the image for generating a combined feature descriptor. For question modeling, we apply pre-trained word embeddings from Glove 
, taking advantage from the transfer learning capabilities of neural networks; and a Bidirectional LSTM (BLSTM), able to learn rich question information by taking into account temporal relationships both in past-to-future and future-to-past manner. Next, we fuse both modalities and finish by applying a classification model for obtaining the resulting answer.
, we describe the dataset and the evaluation metrics used. We evaluate our model and compare it with the state of the art. Finally, inSection 4, we give some concluding remarks and some directions of future work.
In this section, we describe our VQA system, named Visual Bidirectional Kernelized Network (VIBIKNet), whose general scheme can be seen in Fig. 1. We also make public the complete source code111https://github.com/MarcBS/VIBIKNet for reproducing the results obtained.
The VQA problem consists in computing a function which, having as input an image and a related question , produces a textual answer :
where and are two variable-sized sequences of words, which can be formalized as and , respectively.
We formulate the problem under a probabilistic framework. Given the clear multimodality of it, first, we propose to extract independent representations for image and question. For obtaining a rich representation of the image, we apply a KCNN  (Section 2.1). We process the question with a BLSTM network (Section 2.2), which considers the full question context. Next, we need to combine modalities into a single representation. To this purpose we propose using a simple, yet effective, element-wise summation (see Section 2.3
2.1 KCNN for image representation
A key factor that makes humans able to understand what happens on a picture is the ability to distinguish each of the present elements in it, regarding any possible scale or orientation, together with the relationships and actions that are taking place between them. When we talk about elements we refer to any object, person or animal appearing in the images.
Following this idea, the so-called Kernelized Deep Convolutional Neural Network method  has the ability to capture all these aspects. In Fig. 2 we show the general pipeline of steps for extracting KCNN features from images.
More formally, given two images, and , and a set of variable-sized regions for each of them , and , we can define their similarity given by a kernel as:
where the similarity between two regions is computed by their inner product,
denotes a linear/non-linear transformation anddenotes the final vectorial image representation composed by the set of initial regions.
Going back to the general scheme applied, initially, an object detector is used for extracting object candidate bounding boxes from each image, . After that, and in order to provide robustness to the point of view, a set of rotations are applied separately to each of the extracted image regions before extracting their image features through a CNN, in Eq. 2
. Next, a PCA transformation is applied to the vectors from all image regions. In order to aggregate all vectors from a single image, we learn a Fisher kernel which, similarly to a Bag-of-Words approach 
, jointly models the features distribution by learning a Gaussian Mixture Model (GMM), namelyin Eq. 2. In order to have manageable vector sizes, an additional PCA is applied to the resulting aggregated vectors. This produces an -size representation of the image, which is finally normalized in order to obtain the final representation of the image ().
2.2 Bidirectional LSTM for question representation
As stated above, a question is a variable-sized sequence of words. We use a powerful sequence modeler such a RNN for characterizing : each word is inputted to the system following a -hot codification. Next, we project each word to a continuous space by means of a learnable word embedding matrix. In order to effectively train our word embedding model, we start from pre-trained word vectors provided by Glove  and we fine-tune them with the questions corpus. Words not included in Glove are randomly initialized.
The sequence of word embeddings is then inputted to a bidirectional RNN. Bidirectional RNNs are made up of two independent recurrent layers, each of them analyzing the input sequence in one direction. Hence, the forward layer processes the sequence from the left to the right while the backward layer process it from the right to the left. In our case, each recurrent layer is an LSTM layer.
LSTM networks allow to deal with the vanishing gradient problem. These layers maintain two internal states, namely the hidden state () and the memory state (). The amount of information that flows through the network is modulated by the input (), output () and forget () gates. Refer to  for a more in-depth review of the LSTM networks.
For obtaining a representation of the complete question, we concatenate the last hidden state from the forward and backward layers:
where and are the last forward and backward hidden states, of size . denotes vectorial concatenation and is the final representation of . Since each LSTM layer processes the complete input sequence in one direction, contains both left-to-right and right-to-left dependencies.
2.3 Multimodal fusion and prediction
Multimodal fusion. Our problem involves information from two different sources. Hence, we must combine both image and text, given that image is represented by a KCNN as a feature vector of size , and that question is represented as , of size .
In order to properly combine both modalities, we first linearly project the image representation to the same space as the question representation, by means of a visual embedding matrix:
where is a matrix, jointly estimated with the rest of the model.
Then, a fusion operation is applied on both modalities, and :
where is the fusion operator and is the joint, multimodal representation of the image and question.
Prediction. Given the nature of the task at hand, a typical answer has few words. More precisely, in the VQA dataset (Section 3.1), the 89.3% of the answers are single-worded; and the 99.0% of the answers have three or less words .
Therefore, we treat our problem as a classification task over the most repeated answers. The obtained fusion of vision and text (
) is inputted to a fully-connected layer with the set of answers as output. Applying a softmax activation, we define a probability over the possible answers. At test time, we choose the answerwith the highest probability:
3 Experiments and results
In this section we set up the experimentation and evaluation procedure. Moreover, we study and discuss the obtained results in the VQA Challenge222The VQA Challenge leaderboard is available at http://visualqa.org/roe.html.
3.1 Dataset and evaluation
We evaluate our model on the VQA dataset , on the real open-ended task. The dataset consists of approximately 200,000 images from the MSCOCO dataset . Each image has three questions associated and each question has ten answers, which were provided by human annotators. We used the default splits for the task: Train (80,000 images) for training, Test-Dev (40,000 images) for validating the model and Test-Standard (80,000 images) for testing it. An additional partition, Test-Challenge, was used for evaluating the model at the VQA Challenge.
We followed the VQA evaluation protocol , which computes an accuracy between the system output () and the answers provided by the humans:
3.2 Experimental setup
We set the model hyperparameters according to empirical results. For extracting the KCNN features, we used: EdgeBoxes for proposing 100 object regions, a set of 8 different object rotations of degrees, the last FC layer of GoogLeNet  (1024-dimensional) for extracting features on each object, applied a PCA of dimensions 128 before, and after the GMM, respectively, and learned 128 gaussians during GMM training.
Since we used Glove vectors, the word embedding size was fixed to 300. The BLSTM network had units in each layer, which accounts for a total 500 units. The visual embedding had a size of . We applied a classification over the 2,000 most frequent answers, which covered a 86.8% of the whole dataset. As fusion operator ( in Eq. 5) we tested element-wise summation and element-wise concatenation. Following , we also tested MCB pooling as .
We used the Adam  optimizer with an initial learning rate of . As regularization strategy, we only applied dropout before the classification layer.
3.3 Experimental results
Table 1 shows the accuracies of variations of our model (top) and of other works (bottom) for the Test-Dev and Test-Standard splits. Results are separated according to the type of answer, namely yes/no (Y/N), numerical (Num.) and other (Other) answers. We also report the overall accuracy of the task (All).
|Test-Dev [%]||Test-Standard [%]|
|G-K L sum||79.0||38.2||33.7||52.9||–||–||–||–|
|G-K BL FC sum||78.6||33.6||36.9||52.1||–||–||–||–|
|G-K BL FC cat||79.0||33.6||38.3||53.0||–||–||–||–|
|R BL sum||77.8||30.6||38.6||52.3||–||–||–||–|
|G-K BL cat||79.0||33.4||38.5||53.0||–||–||–||–|
|G-K BL MCB||79.2||33.2||37.5||52.5||–||–||–||–|
It can be seen that both summation and concatenation fusion strategies performed similarly. In terms of performance, MCB was also similar to them. Nevertheless, MCB was much more resource-demanding: while the average time per epoch of summation was 320s, MCB required approximately 5,800s. Moreover, adding a fully-connected layer after text characterization and before fusion did not help, meaning that the visual embedding mechanism suffices for providing a robust visual-text embedding. Regarding image characterization, if we compare the results using ResNet-152 vs GoogLeNet-KCNN, we can see that even using a less powerful CNN architecture, the adoption of the KCNN representation provided better results than simply using the ResNet output. Finally, it is worth noting that we used a single model for prediction. The use of network ensembles typically offer a performance boost. In Fig. 3 we can see some qualitative examples of our methodology.
4 Conclusions and Future Work
We proposed a method for VQA which offers a trade-off between the accuracy and the computational cost of the model. We have proven that kernelized methods for image representation based on CNNs are very powerful for the problem at hand. Additionally, we have shown that using simple fusion methods like summation or concatenation can produce similar results to more elaborate methods at the same time that provide a very efficient computation. Nevertheless, we are aware that performing the multimodal fusion at deeper levels may be beneficial.
As future directions, we aim to delve into better fusion strategies but keeping a low computational cost. We extracted KCNN features based on local representations (objects appearance), but using them together with end-to-end trainable attention mechanisms may lead to higher performances .
This work was partially funded by TIN2015-66951-C2-1-R, SGR 1219, CERCA Programme / Generalitat de Catalunya, CoMUN-HaT - TIN2015-70924-C2-1-R (MINECO/FEDER), PrometeoII/2014/030 and R-MIPRCV. P. Radeva is partially supported by ICREA Academia’2014. We acknowledge NVIDIA Corporation for the donation of a GPU used in this work.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: Visual question answering. In ICCV, pages 2425–2433, 2015.
-  X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325, 2015.
-  G. Cheng, P. Zhou, and J. Han. Rifd-cnn: Rotation-invariant and fisher discriminative convolutional neural networks for object detection. In CVPR, pages 2884–2893, 2016.
-  A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847, 2016.
-  F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural computation, 12(10):2451–2471, 2000.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang. Multimodal residual learning for visual qa. arXiv:1606.01455, 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
-  Z. Liu. Kernelized deep convolutional neural network for describing complex images. arXiv:1509.04581, 2015.
-  H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. arXiv:1611.00471, 2016.
-  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
Á. Peris, M. Bolaños, P. Radeva, and F. Casacuberta.
Video description using bidirectional recurrent neural networks.In ICANN, volume 2, pages 3–11, 2016.
-  F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, pages 143–156. Springer, 2010.
-  J. Sivic and A. Zisserman. Efficient visual search of videos cast as text retrieval. PAMI, 31(4):591–606, 2009.
-  L. Specia, S. Frank, K. Sima’an, and D. Elliott. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation, pages 543–553. ACL, 2016.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, volume 27, pages 3104–3112. 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
-  K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044, 2015.
-  C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014.