Hierarchical Question-Image Co-Attention for Visual Question Answering

05/31/2016 ∙ by Jiasen Lu, et al. ∙ Virginia Polytechnic Institute and State University 0

A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3 dataset. By using ResNet, the performance is further improved to 62.1 and 65.4



There are no comments yet.


page 8

page 9

Code Repositories


Visual Question Answering Systems (Project: CS231n)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual Question Answering (VQA) antol2015vqa ; gao2015you ; malinowski2015ask ; ren2015exploring ; zitnick2016measuring has emerged as a prominent multi-discipline research problem in both academia and industry. To correctly answer visual questions about an image, the machine needs to understand both the image and question. Recently, visual attention based models shih2015look ; xiong2016dynamic ; xu2015ask ; yang2015stacked have been explored for VQA, where the attention mechanism typically produces a spatial map highlighting image regions relevant to answering the question.

So far, all attention models for VQA in literature have focused on the problem of identifying “where to look” or visual attention. In this paper, we argue that the problem of identifying “which words to listen to” or question attention is equally important. Consider the questions “how many horses are in this image?” and “how many horses can you see in this image?". They have the same meaning, essentially captured by the first three words. A machine that attends to the first three words would arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question. Motivated by this observation, in addition to reasoning about visual attention, we also address the problem of question attention. Specifically, we present a novel multi-modal attention model for VQA with the following two unique features:

Co-Attention: We propose a novel mechanism that jointly reasons about visual attention and question attention, which we refer to as co-attention. Unlike previous works, which only focus on visual attention, our model has a natural symmetry between the image and question, in the sense that the image representation is used to guide the question attention and the question representation(s) are used to guide image attention.

Question Hierarchy

: We build a hierarchical architecture that co-attends to the image and question at three levels: (a) word level, (b) phrase level and (c) question level. At the word level, we embed the words to a vector space through an embedding matrix. At the phrase level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams and trigrams. Specifically, we convolve word representations with temporal filters of varying support, and then combine the various n-gram responses by pooling them into a single phrase level representation. At the question level, we use recurrent neural networks to encode the entire question. For each level of the question representation in this hierarchy, we construct joint question and image co-attention maps, which are then combined recursively to ultimately predict a distribution over the answers.

Overall, the main contributions of our work are:

  • We propose a novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention. We explore this mechanism with two strategies, parallel and alternating co-attention, which are described in Sec. 3.3;

  • We propose a hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. These co-attended features are then recursively combined from word level to question level for the final answer prediction;

  • At the phrase level, we propose a novel convolution-pooling strategy to adaptively select the phrase sizes whose representations are passed to the question level representation;

  • Finally, we evaluate our proposed model on two large datasets, VQA antol2015vqa and COCO-QA ren2015exploring . We also perform ablation studies to quantify the roles of different components in our model.

Figure 1: Flowchart of our proposed hierarchical co-attention model. Given a question, we extract its word level, phrase level and question level embeddings. At each level, we apply co-attention on both the image and question. The final answer prediction is based on all the co-attended image and question features.

2 Related Work

Many recent works antol2015vqa ; gao2015you ; krishna2016visual ; malinowski2015ask ; ren2015exploring ; zhang2015yin ; kim2016multimodal ; fukui2016multimodal have proposed models for VQA. We compare and relate our proposed co-attention mechanism to other vision and language attention mechanisms in literature.

Image attention. Instead of directly using the holistic entire-image embedding from the fully connected layer of a deep CNN (as in antol2015vqa ; ma2015learning ; malinowski2015ask ; ren2015exploring ), a number of recent works have explored image attention models for VQA. Zhu zhu2015visual7w add spatial attention to the standard LSTM model for pointing and grounded QA. Andreas andreas2015deep propose a compositional scheme that consists of a language parser and a number of neural modules networks. The language parser predicts which neural module network should be instantiated to answer the question. Some other works perform image attention multiple times in a stacked manner. In yang2015stacked , the authors propose a stacked attention network, which runs multiple hops to infer the answer progressively. To capture fine-grained information from the question, Xu xu2015ask propose a multi-hop image attention scheme. It aligns words to image patches in the first hop, and then refers to the entire question for obtaining image attention maps in the second hop. In shih2015look , the authors generate image regions with object proposals and then select the regions relevant to the question and answer choice. Xiong xiong2016dynamic augments dynamic memory network with a new input fusion module and retrieves an answer from an attention based GRU. In concurrent work, das2016human collected ‘human attention maps’ that are used to evaluate the attention maps generated by attention models for VQA. Note that all of these approaches model visual attention alone, and do not model question attention. Moreover, xu2015ask ; yang2015stacked model attention sequentially, i.e., later attention is based on earlier attention, which is prone to error propagation. In contrast, we conduct co-attention at three levels independently.

Language Attention

. Though no prior work has explored question attention in VQA, there are some related works in natural language processing (NLP) in general that have modeled language attention. In order to overcome difficulty in translation of long sentences, Bahdanau

bahdanau2014neural propose RNNSearch to learn an alignment over the input sentences. In hermann2015teaching , the authors propose an attention model to circumvent the bottleneck caused by fixed width hidden vector in text reading and comprehension. A more fine-grained attention mechanism is proposed in rocktaschel2015reasoning . The authors employ a word-by-word neural attention mechanism to reason about the entailment in two sentences. Also focused on modeling sentence pairs, the authors in yin2015abcnn propose an attention-based bigram CNN for jointly performing attention between two CNN hierarchies. In their work, three attention schemes are proposed and evaluated. In santos2016attentive , the authors propose a two-way attention mechanism to project the paired inputs into a common representation space.

3 Method

We begin by introducing the notation used in this paper. To ease understanding, our full model is described in parts. First, our hierarchical question representation is described in Sec. 3.2 and the proposed co-attention mechanism is then described in Sec. 3.3. Finally, Sec. 3.4 shows how to recursively combine the attended question and image features to output answers.

3.1 Notation

Given a question with words, its representation is denoted by , where is the feature vector for the -th word. We denote , and as word embedding, phrase embedding and question embedding at position , respectively. The image feature is denoted by , where is the feature vector at the spatial location . The co-attention features of image and question at each level in the hierarchy are denoted as and where . The weights in different modules/layers are denoted with , with appropriate sub/super-scripts as necessary. In the exposition that follows, we omit the bias term to avoid notational clutter.

3.2 Question Hierarchy

Given the 1-hot encoding of the question words , we first embed the words to a vector space (learnt end-to-end) to get . To compute the phrase features, we apply 1-D convolution on the word embedding vectors. Concretely, at each word location, we compute the inner product of the word vectors with filters of three window sizes: unigram, bigram and trigram. For the -th word, the convolution output with window size is given by


where is the weight parameters. The word-level features

are appropriately 0-padded before feeding into bigram and trigram convolutions to maintain the length of the sequence after convolution. Given the convolution result, we then apply max-pooling across different n-grams at each word location to obtain phrase-level features


Our pooling method differs from those used in previous works hu2014convolutional in that it adaptively selects different gram features at each time step, while preserving the original sequence length and order. We use a LSTM to encode the sequence after max-pooling. The corresponding question-level feature is the LSTM hidden vector at time .

Our hierarchical representation of the question is depicted in Fig. 3(a).

3.3 Co-Attention

We propose two co-attention mechanisms that differ in the order in which image and question attention maps are generated. The first mechanism, which we call parallel co-attention, generates image and question attention simultaneously. The second mechanism, which we call alternating co-attention, sequentially alternates between generating image and question attentions. See Fig. 2. These co-attention mechanisms are executed at all three levels of the question hierarchy.

Figure 2: (a) Parallel co-attention mechanism; (b) Alternating co-attention mechanism.

Parallel Co-Attention. Parallel co-attention attends to the image and question simultaneously. Similar to xu2015ask , we connect the image and question by calculating the similarity between image and question features at all pairs of image-locations and question-locations. Specifically, given an image feature map , and the question representation

, the affinity matrix

is calculated by


where contains the weights. After computing this affinity matrix, one possible way of computing the image (or question) attention is to simply maximize out the affinity over the locations of other modality, and . Instead of choosing the max activation, we find that performance is improved if we consider this affinity matrix as a feature and learn to predict image and question attention maps via the following


where , are the weight parameters. and

are the attention probabilities of each image region

and word respectively. The affinity matrix transforms question attention space to image attention space (vice versa for ). Based on the above attention weights, the image and question attention vectors are calculated as the weighted sum of the image features and question features, i.e.,


The parallel co-attention is done at each level in the hierarchy, leading to and where .

Alternating Co-Attention. In this attention mechanism, we sequentially alternate between generating image and question attention. Briefly, this consists of three steps (marked in Fig. 2b): 1) summarize the question into a single vector ; 2) attend to the image based on the question summary ; 3) attend to the question based on the attended image feature.

Concretely, we define an attention operation , which takes the image (or question) features and attention guidance derived from question (or image) as inputs, and outputs the attended image (or question) vector. The operation can be expressed in the following steps


where is a vector with all elements to be 1. and are parameters. is the attention weight of feature .

The alternating co-attention process is illustrated in Fig. 2 (b). At the first step of alternating co-attention, , and is ; At the second step, where is the image features, and the guidance is intermediate attended question feature from the first step; Finally, we use the attended image feature as the guidance to attend the question again, i.e., and . Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy.

3.4 Encoding for Predicting Answers

Figure 3: (a) Hierarchical question encoding (Sec. 3.2); (b) Encoding for predicting answers (Sec. 3.4).

Following antol2015vqa

, we treat VQA as a classification task. We predict the answer based on the co-attended image and question features from all three levels. We use a multi-layer perceptron (MLP) to recursively encode the attention features as shown in Fig. 



where and are the weight parameters. is the concatenation operation on two vectors. is the probability of the final answer.

4 Experiment

4.1 Datasets

We evaluate the proposed model on two datasets, the VQA dataset antol2015vqa and the COCO-QA dataset ren2015exploring .

VQA dataset antol2015vqa is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset lin2014microsoft . The dataset contains 248,349 training questions, 121,512 validation questions, 244,302 testing questions, and a total of 6,141,630 question-answers pairs. There are three sub-categories according to answer-types including yes/no, number, and other. Each question has 10 free-response answers. We use the top 1000 most frequent answers as the possible outputs similar to antol2015vqa . This set of answers covers 86.54% of the train+val answers. For testing, we train our model on VQA train+val and report the test-dev and test-standard results from the VQA evaluation server. We use the evaluation protocol of antol2015vqa in the experiment.

COCO-QA dataset ren2015exploring is automatically generated from captions in the Microsoft COCO dataset lin2014microsoft . There are 78,736 train questions and 38,948 test questions in the dataset. These questions are based on 8,000 and 4,000 images respectively. There are four types of questions including object, number, color, and location. Each type takes , , , and of the whole dataset, respectively. All answers in this data set are single word. As in ren2015exploring , we report classification accuracy as well as Wu-Palmer similarity (WUPS) in Table 2.

4.2 Setup

We use Torch


to develop our model. We use the Rmsprop optimizer with a base learning rate of 4e-4, momentum 0.99 and weight-decay 1e-8. We set batch size to be 300 and train for up to 256 epochs with early stopping if the validation accuracy has not improved in the last 5 epochs. For COCO-QA, the size of hidden layer

is set to 512 and 1024 for VQA since it is a much larger dataset. All the other word embedding and hidden layers were vectors of size 512. We apply dropout with probability on each layer. Following yang2015stacked , we rescale the image to , and then take the activation from the last pooling layer of VGGNet Simonyan14c or ResNet he2015deep as its feature.

4.3 Results and Analysis

There are two test scenarios on VQA: open-ended and multiple-choice. The best performing method deeper LSTM Q + norm I from antol2015vqa is used as our baseline. For open-ended test scenario, we compare our method with the recent proposed SMem xu2015ask , SAN yang2015stacked , FDA Ilievski2016 and DMN+ xiong2016dynamic . For multiple choice, we compare with Region Sel. shih2015look and FDA Ilievski2016 . We compare with 2-VIS+BLSTM ren2015exploring , IMG-CNN ma2015learning and SAN yang2015stacked on COCO-QA. We use to refer to our parallel co-attention, for alternating co-attention.

Table 1 shows results on the VQA test sets for both open-ended and multiple-choice settings. We can see that our approach improves the state of art from 60.4% (DMN+ xiong2016dynamic ) to 62.1% (+ResNet) on open-ended and from 64.2% (FDA Ilievski2016 ) to 66.1% (+ResNet) on multiple-choice. Notably, for the question type Other and Num, we achieve 3.4% and 1.4% improvement on open-ended questions, and 4.0% and 1.1% on multiple-choice questions. As we can see, ResNet features outperform or match VGG features in all cases. Our improvements are not solely due to the use of a better CNN. Specifically, FDA Ilievski2016 also uses ResNet he2015deep , but +ResNet outperforms it by 1.8% on test-dev. SMem xu2015ask uses GoogLeNet szegedy2015going and the rest all use VGGNet Simonyan14c , and Ours+VGG outperforms them by 0.2% on test-dev (DMN+ xiong2016dynamic ).

Table 2 shows results on the COCO-QA test set. Similar to the result on VQA, our model improves the state-of-the-art from 61.6% (SAN(2,CNN) yang2015stacked ) to 65.4% (+ResNet). We observe that parallel co-attention performs better than alternating co-attention in this setup. Both attention mechanisms have their advantages and disadvantages: parallel co-attention is harder to train because of the dot product between image and text which compresses two vectors into a single value. On the other hand, alternating co-attention may suffer from errors being accumulated at each round.

Open-Ended Multiple-Choice
test-dev test-std test-dev test-std
Method Y/N Num Other All All Y/N Num Other All All
LSTM Q+I antol2015vqa 80.5 36.8 43.0 57.8 58.2 80.5 38.2 53.0 62.7 63.1
Region Sel. shih2015look - - - - - 77.6 34.3 55.8 62.4 -
SMem xu2015ask 80.9 37.3 43.1 58.0 58.2 - - - - -
SAN yang2015stacked 79.3 36.6 46.1 58.7 58.9 - - - - -
FDA Ilievski2016 81.1 36.2 45.8 59.2 59.5 81.5 39.0 54.7 64.0 64.2
DMN+ xiong2016dynamic 80.5 36.8 48.3 60.3 60.4 - - - - -
+VGG 79.5 38.7 48.3 60.1 - 79.5 39.8 57.4 64.6 -
+VGG 79.6 38.4 49.1 60.5 - 79.7 40.1 57.9 64.9 -
+ResNet 79.7 38.7 51.7 61.8 62.1 79.7 40.0 59.8 65.8 66.1
Table 1: Results on the VQA dataset. “-” indicates the results is not available.
Method Object Number Color Location Accuracy WUPS0.9 WUPS0.0
2-VIS+BLSTM ren2015exploring 58.2 44.8 49.5 47.3 55.1 65.3 88.6
IMG-CNN ma2015learning - - - - 58.4 68.5 89.7
SAN(2, CNN) yang2015stacked 64.5 48.6 57.9 54.0 61.6 71.6 90.9
+VGG 65.6 49.6 61.5 56.8 63.3 73.0 91.3
+VGG 65.6 48.9 59.8 56.7 62.9 72.8 91.3
+ResNet 68.0 51.0 62.9 58.8 65.4 75.1 92.0
Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available.

4.4 Ablation Study

In this section, we perform ablation studies to quantify the role of each component in our model. Specifically, we re-train our approach by ablating certain components:

  • Image Attention alone, where in a manner similar to previous works yang2015stacked , we do not use any question attention. The goal of this comparison is to verify that our improvements are not the result of orthogonal contributions. (say better optimization or better CNN features).

  • Question Attention alone, where no image attention is performed.

  • W/O Conv, where no convolution and pooling is performed to represent phrases. Instead, we stack another word embedding layer on the top of word level outputs.

  • W/O W-Atten

    , where no word level co-attention is performed. We replace the word level attention with a uniform distribution. Phrase and question level co-attentions are still modeled.

  • W/O P-Atten, where no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled.

  • W/O Q-Atten, where no question level co-attention is performed. We replace the question level attention with a uniform distribution. Word and phrase level co-attentions are still modeled.

Table 3 shows the comparison of our full approach w.r.t these ablations on the VQA validation set (test sets are not recommended to be used for such experiments). The deeper LSTM Q + norm I baseline in (antol2015vqa, ) is also reported for comparison. We can see that image-attention-alone does improve performance over the holistic image feature (deeper LSTM Q + norm I), which is consistent with findings of previous attention models for VQA xiong2016dynamic ; yang2015stacked .

Method Y/N Num Other All
LSTM Q+I 79.8 32.9 40.7 54.3
Image Atten 79.8 33.9 43.6 55.9
Question Atten 79.4 33.3 41.7 54.8
W/O Q-Atten 79.6 32.1 42.9 55.3
W/O P-Atten 79.5 34.1 45.4 56.7
W/O W-Atten 79.6 34.4 45.6 56.8
Full Model 79.6 35.0 45.7 57.0
Table 3: Ablation study on the VQA dataset using +VGG.

Comparing the full model ablated versions without word, phrase, question level attentions reveals a clear interesting trend – the attention mechanisms closest to the ‘top’ of the hierarchy (question) matter most, with a drop of 1.7% in accuracy if not modeled; followed by the intermediate level (phrase), with a drop of 0.3%; finally followed by the ‘bottom’ of the hierarchy (word), with a drop of 0.2% in accuracy. We hypothesize that this is because the question level is the ‘closest’ to the answer prediction layers in our model. Note that all levels are important, and our final model significantly outperforms not using any linguistic attention (1.1% difference between Full Model and Image Atten). The question attention alone model is better than LSTM Q+I, with an improvement of 0.5% and worse than image attention alone, with a drop of 1.1%. further improves if we performed alternating co-attention for one more round, with an improvement of 0.3%.

4.5 Qualitative Results

We now visualize some co-attention maps generated by our method in Fig. 4. At the word level, our model attends mostly to the object regions in an image, e.g., heads, bird. At the phrase level, the image attention has different patterns across images. For the first two images, the attention transfers from objects to background regions. For the third image, the attention becomes more focused on the objects. We suspect that this is caused by the different question types. On the question side, our model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. For example, our model pays attention to the phrases “what color” and “how many snowboarders”. Our model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region. Because our model performs co-attention at three levels, it often captures complementary information from each level, and then combines them to predict the answer.

Q: what is the man holding a snowboard on top of a snow covered? A: mountain what is the man holding a snowboard on top of a snow covered what is the man holding a snowboard on top of a snow covered ? what is the man holding a snowboard on top of a snow covered ?

Q: what is the color of the bird? A: white what is the color of the bird ? what is the color of the bird ? what is the color of the bird ?

Q: how many snowboarders in formation in the snow, four is sitting? A: 5 how many snowboarders in formation in the snow , four is sitting ? how many snowboarders in formation in the snow , four is sitting ? how many snowboarders in formation in the snow , four is sitting ?
Figure 4: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right: original image and question pairs, word level co-attention maps, phrase level co-attention maps and question level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to blue:low). Best viewed in color.

5 Conclusion

In this paper, we proposed a hierarchical co-attention model for visual question answering. Co-attention allows our model to attend to different regions of the image as well as different fragments of the question. We model the question hierarchically at three levels to capture information from different granularities. The ablation studies further demonstrate the roles of co-attention and question hierarchy in our final performance. Through visualizations, we can see that our model co-attends to interpretable regions of images and questions for predicting the answer. Though our model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.


This work was funded in part by NSF CAREER awards to DP and DB, an ONR YIP award to DP, ONR Grant N00014-14-1-0679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, a Allen Distinguished Investigator award to DP from the Paul G. Allen Family Foundation, ICTAS Junior Faculty awards to DB and DP, Google Faculty Research Awards to DP and DB, AWS in Education Research grant to DB, and NVIDIA GPU donations to DB. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government or any sponsor.

Q: what is the color of the kitten? A: black Q: what are standing in tall dry grass look at the tourists? A: zebras Q: where is the woman while her baby is sleeping? A: kitchen Q: what seating area is on the right? A: park Q: is the person dressed properly for this sport? A: yes what is the color of the kitten ? what are standing in tall dry grass look at the tourists ? where is the woman while her baby is sleeping ? what seating area is on the right ? is the person dressed properly for the sport ? what is the color of the kitten ? what are standing in tall dry grass look at the tourists ? where is the woman while her baby is sleeping ? what seating area is on the right ? is the person dressed properly for the sport ? what is the color of the kitten ? what are standing in tall dry grass look at the tourists ? where is the woman while her baby is sleeping ? what seating area is on the right ? is the person dressed properly for the sport ?
Figure 5: Visualization of co-attention maps on success cases in the COCO-QA (first three columns using +VGG) and VQA (last two columns +VGG) dataset. The layout is the same as Fig. 4.
Q: how many red motorcycles with riders in protective gear are on the street? A: two(three) Q: what is the color of the bus? A: green(red) Q: what is shown in different places? A: hydrant(toy) Q:do the doors open in or out? A: open(in) Q: is this an open market at night? A: yes(no) how many red motorcycles with riders in protective gear are on the street ? what is the color of the bus ? what is shown in different places ? do the doors open in or out ? is this an open market at night ? how many red motorcycles with riders in protective gear are on the street ? what is the color of the bus ? what is shown in different places ? do the doors open in or out ? is this an open market at night ? how many red motorcycles with riders in protective gear are on the street ? what is the color of the bus ? what is shown in different places ? do the doors open in or out ? is this an open market at night ?
Figure 6: Visualization of co-attention maps on failure cases in the COCO-QA (first three columns using +VGG) and VQA (last two columns +VGG) dataset. Predicted answer is in red and ground truth answer is in green. The layout is the same as Fig. 4.