Reciprocal Attention Fusion for Visual Question Answering

05/11/2018 ∙ by Moshiur R Farazi, et al. ∙ 0

Existing attention mechanisms either attend to local image grid or object level features for Visual Question Answering (VQA). Motivated by the observation that questions can relate to both object instances and their parts, we propose a novel attention mechanism that jointly considers reciprocal relationships between the two levels of visual details. The bottom-up attention thus generated is further coalesced with the top-down information to only focus on the scene elements that are most relevant to a given question. Our design hierarchically fuses multi-modal information i.e., language, object- and gird-level features, through an efficient tensor decomposition scheme. The proposed model improves the state-of-the-art single model performances from 67.9 significant boost.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An AI agent equipped with visual question answering ability can respond to intelligent questions about a complex scene. This task bridges the gap between visual and language understanding to realize the longstanding goal of highly intelligent machine vision systems. Recent advances in automatic feature learning with deep neural networks allow joint processing of both visual and language modalities in a unified framework, leading to significant improvements on the challenging VQA problem

[Antol et al. (2015), Krishna et al. (2016), Johnson et al. (2016), Zhu et al. (2016), Goyal et al. (2016)].

To deduce the correct answer, an AI agent needs to correlate image and question information. A predominant focus in the existing efforts has remained on attending to local regions on the image-grid based on language input [Xu et al. (2015), Lu et al. (2016), Yang et al. (2016), Jabri et al. (2016), Shih et al. (2016)]. Since these regions do not necessarily correspond to representative scene elements (objects, attributes and actions), there exists a "semantic gap" in such attention mechanisms. To address this issue, Anderson et al[Anderson et al. (2018)] proposed to work at the object level, where model attention is spread over a set of possible object locations. However, the object proposal set considered in this way is non-exhaustive and can miss important aspects of a scene. Furthermore, language questions can pertain to local details about objects parts and attributes, which are not encompassed by the object-level scene decomposition.

Figure 1: Applying attention to reciprocal visual features allow a VQA model to obtain the most relevant informations required to answer a given visual question.

In this work, we propose to simultaneously attend to both low-level visual concepts as well as the high-level object based scene representation. Our intuition is based on the fact that the questions can be related to objects, object-parts and local attributes, therefore focusing on a single scene representation can degrade model capacity. To this end, we jointly attend to two reciprocal scene representations that encompass local information on the image-grid and the object-level features. The bottom-up attention thus generated is further combined with the top-down attention driven by the linguistic input. Our design draws inspiration from the human cognitive psychology, where attention mechanism is known to be a combination of both exogenous (bottom-up) and endogenous (top-down) factors

[Desimone and Duncan (1995), Borji and Itti (2013)].

Given the multi-modal inputs, a critical requirement is to effectively model complex interactions between the multi-level bottom-up and top-down factors. For this purpose, we propose a multi-branch CNN architecture that hierarchically fuses visual and linguistic features by leveraging an efficient tensor decomposition mechanism [Tucker (1966), Ben-Younes et al. (2017)]. Our experiments and extensive ablative study proves that a language driven attention on both image-grid and object level representation allows a deep network to model the complex interaction between vision and language as our model outperforms the state-of-the-art models in VQA tasks.

In summary, this paper makes the following key contributions:

  • [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  • A hierarchical architecture incorporating both the bottom-up and top-down factors pertaining to meaningful scene elements and their parts.

  • Co-attention mechanism enhancing scene understanding by combining local image-grid and object-level visual cues.

  • Extensive evaluation and ablation on both balanced and imbalanced versions of the large-scale VQA dataset achieving single model state-of-the-art performance in both.

2 Related Works

Deep Networks:

Given the success of deep learning, one common approach to address the VQA problem is by generating image features using pretrained Convolutional Neural Networks (CNNs), e.g., VGGNet

[Simonyan and Zisserman (2014)], ResNet [He et al. (2016)]

, and language features using word-embeddings or Long Short-Term Memory (LSTM)

[Antol et al. (2015)]

. After generating image and language features, some approaches train RNNs to generate top-K candidate answers and use a multi-class classifier to choose the best answer

[Antol et al. (2015), Zhu et al. (2016), Zhou et al. (2015)]. A number of attention mechanisms have been incorporated within deep networks to automatically focus on specific details in an image based on the given question [Lu et al. (2016), Yang et al. (2016), Jabri et al. (2016)]. Memory networks have also been incorporated in many top performing models [Xiong et al. (2016), Sukhbaatar et al. (2015), (39)] where the questions required the system to compare attributes or use a long reasoning chain. While robust features and memory modules help capture some aspects of the semantics present in the scene, modeling the complex interplay between image-grid and objects level features can complement the understanding of the rich scene semantics.

Attention Models: The incorporation of spatial attention on the image and/or the text features has been investigated to capture the most important parts required to answer a given question [Xu et al. (2015), Yang et al. (2016), Jabri et al. (2016), Shih et al. (2016), Lu et al. (2016)]. Different pooling methods have been used previously to compute the attention maps such as soft attention, bilinear pooling and tucker fusion [Xu et al. (2015), Gao et al. (2016), Ben-Younes et al. (2017)]. All these techniques explore top-down attention and only focus on the image-grid. Different to these works, based on the observation that questions pertain to objects, their parts and attributes, we propose to work jointly at the spatial grid of image regions and the object-level. The closest to our work is [Anderson et al. (2018)], which attends to salient objects in an image for improved VQA. However, they ignore two key aspects of visual reasoning i.e., the image level visual features and an effective fusion mechanism to combine the bimodal interaction between visual and language features. Another recent effort [Lu et al. (2017)] co-attends to both image regions and objects, but uses a simplistic fusion mechanism based on element-wise multiplication that is outperformed by our bilinear feature encoding. Besides, our multi-level attention mechanism effectively uses object features and scene context based on the natural language queries.

3 Methods

The VQA task requires an AI agent to generate a natural language response, given a visual (i.e. image, video) and natural language input (i.e. questions, parse). We formulate VQA task as a classification task, where the model predicts the correct answer () from all possible answers for a given image (v) and question (q) pair:


where denotes the set of parameters used to predict the best answer from the set of all possible answers .

Our proposed architecture to perform VQA task is illustrated in Figure 2. The key highlights of our proposed architecture include a hierarchical attention mechanism that focuses on complementary levels of scene details i.e., grid of image regions and object proposals. The relevant co-attended features are then fused together to perform final prediction. We name our model as the ‘Reciprocal Attention Fusion’ because it simultaneously attends to two complementary scene representations i.e., image-grid and object proposals. Our experimental results demonstrate that both levels of scene details are reciprocal and reinforce each other to achieve the best single-model performance on challenging VQA task. Before elaborating on the hierarchical attention and feature fusion, we first discuss the joint feature embedding in Section 3.1.

Figure 2: Given an image-question pair, our model employs (1) Joint Feature Embedding (Sec.3.1) to embed (a) Language Feature , (b) Image-Level Feature and (c) Object-Level Feature . Further, these embeddings undergo (2) Hierarchical Attention Fusion (Sec.3.2) which consists of (d) Image-Question and (e) Object-Question Fusion followed by top-down attention. These multi-modal representations are combined together by (3) Co-attention Fusion (Sec.3.3) that predicts an answer for the given Image-Question pair. Overall, the proposed model attends to complementary levels of scene details and fuses multi-modal information to predict highly accurate answers.

3.1 Joint Feature Embedding


be the collection of all visual features extracted from an image and

be the language features extracted from the question. The objective of joint embedding is to learn the language feature representation and multilevel visual features . These feature representations are used to encode the multilevel relationships between question and image which in turn is used to train the classifier to select the correct answer.

Multilevel visual features: The multilevel visual embedding consists of image level features and object level features . Our model employs ResNeXt [Xie et al. (2016)] to obtain image level features, by taking the output of convolution layer before the final pooling layer, where denotes the number of spatial grid locations of the extracted visual feature with dimensions. This convolution layer retains the spatial information of the original image and enable the model to apply attention on the image-grid. On the other hand, our model employs object detectors to localize object instances and pass them through another deep CNN to generate object level features for object proposals. We use Faster R-CNN [Ren et al. (2015)] with ResNet-101 [He et al. (2016)]

backbone and pretrain the object detector on ImageNet

[Deng et al. (2009)] and again retrain it on Visual Genome Dataset [Krishna et al. (2016)] with class label and attribute features similar to [Anderson et al. (2018)].

Bottom-up (BU) Attention: In order to focus on the most relevant features, two bottom-up attention mechanisms are applied during multilevel feature extraction. The image-grid attention is generated using ResNeXt [Xie et al. (2016)] pretrained on ImageNet [Deng et al. (2009)] to obtain , which represents

dimensional features vectors for

image-grid over the visual input. The size and scale of the image-grid can be changed by using different CNN architecture or taking the output of a different convolutional layer to generate a different sized BU attention. Meanwhile, object proposals are generated in a bottom up fashion to encode object level visual features . We select a total of top object proposals whose dimensional feature vectors are obtained from the ROI pooling layer in the Region Proposal Network.

Language features: To represent the questions embedding in an end-to-end framework, GRUs [Cho et al. (2014)] are used in a manner similar to [Fukui et al. (2016), Ben-Younes et al. (2017)]. The words in questions are encoded using one-hot-vector representation and embedded into vector space by using a word embedding matrix. The embedded word vectors are fed to the GRU with units initialized with pretrained Skip-thought Vector model [Kiros et al. (2015)]. The output of the GRU is fine-tuned to get the language feature embedding where . The language feature embedding is used to further refine the spatial visual features (i.e. image-grid and object level) by incorporating top-down attention discussed in Section 3.2.

3.2 Hierarchical Attention Fusion

The hierarchical attention mechanism takes spatial visual features and language feature as input and a learns multi-modal representation to predict answer embedding . This step can be formulated as an outer product of the multi-modal representation, visual and language embeddings as follows:


where, denotes n-mode tensor-matrix product. However, this approach has some serious practical limitations in terms of learnable parameters for as the visual and language feature are very high dimensional, which results in huge computational and memory requirements. To counter this problem, our model employs a multi-modal fusion operation to encode the relationships between these two modalities, which is discussed next.

Multi-modal Fusion: Multi-modal fusion aims to reduce the number of free parameters in tensor for a fully parameterized VQA bilinear model. Our model achieves this by using Tucker Decomposition [Tucker (1966)]

which is a special case of higher-order principal component analysis to express

as a core tensor multiplied by a matrix along input mode. The decomposed tensors are fused in a manner similar to [Ben-Younes et al. (2017)] that encompass the multi-modal relationship between language and vision domain. The tensor can be approximated as:


where , and are factor matrices similar to principal components along each input and output embeddings and is the core tensor which encapsulates interactions between the factor matrices. The notation represents the shorthand for Tucker decomposition. In practice, the decomposed version of is significantly smaller number of parameters than the original tensor [Bader and Kolda (2007)].

After reducing the parameter complexity of with tucker decomposition, the fully parametrized outer product representation in Eq. 2 can be rewritten as:


where and . We define a prediction space where the multi-modal fusion is:


The Tucker decomposition allows our model to decompose into a core tensor and three matrices. The first two matrices, and project the question and visual embeddings to lower and dimensional space that learns to model the multi-modal interaction and projects the resulting output to dimensional vector. We set the input projections dimension to and output projection dimension as . The input and output tensor projection dimensions determine the complexity of the model and the degree of multi-modal interaction which in turn affects the performance of the model. These values are set empirically by testing them on VQAv1 validation dataset. It has been reported in the literature (Fukui et al., 2016; Ben-Younes et al., 2017) that applying nonlinearity to the input feature embeddings improve performance of multi-modal fusion. Therefore, we encode and with nonlinearity during fusion. The output of the multi-modal fusion

passes through convolution and softmax layers to create

and dimensional representation for image-question and object-question embedding respectively. Thus, by employing hierarchical attention fusion, we embed question with spatial visual features to generate image-question and object-question embedding.

Top-down (TD) Attention The image level and object level features are used alongside image-question and object-question embeddings to generate an attention distribution over spatial grid and object proposals respectively. We take weighted sum (WS) of the spatial visual features (i.e.  and ) vectors using the attention weights (i.e.  and ) to generate and which are top-down attended visual features,


3.3 Co-attention Fusion

The attended image-question and object-question visual features represent a combination of visual and language features that are most important to generate an answer for a given question. We concatenate these two bimodal representations to create the final visual-question embedding . The visual-question embedding, and original question embedding again undergo same multi-modal fusion as Eq. 5. The only difference is now as our model uses two glimpse attention which was found to yield better results Fukui et al. (2016); Ben-Younes et al. (2017); Kim et al. (2016). The output of the final fusion is then passed on to the classifier that predicts the best answer from the answer dictionary given question and visual input .

Methods Test-dev Test-standard
Y/N No. Other All Y/N No. Other All
RAF (Ours) 85.9 41.3 58.7 68.0 85.8 41.4 58.9 68.2
ReasonNetIlievski and Feng (2017) - - - - 84.0 38.7 60.4 67.9
MFB+CoAtt+Glove Yu et al. (2018) 85.0 39.7 57.4 66.8 85.0 39.5 57.4 66.9
Dual-MFA Lu et al. (2017) 83.6 40.2 56.8 66.0 83.4 40.4 56.9 66.1
MLB+VG Kim et al. (2016) 84.1 38.0 54.9 65.8 - - - -
MCB+Att+GloVe Fukui et al. (2016) 82.3 37.2 57.4 65.4 - - - -
MLAN Yu et al. (2017) 81.8 41.2 56.7 65.3 81.3 41.9 56.5 65.2
MUTAN Ben-Younes et al. (2017)111Single model performance is evaluated using their publicly available code. 84.8 37.7 54.9 65.2 - - - -
DAN (ResNet) Nam et al. (2016) 83.0 39.1 53.9 64.3 82.8 38.1 54.0 64.2
HieCoAtt Lu et al. (2016) 79.7 38.7 51.7 61.8 - - - 62.1
A+C+K+LSTMWu et al. (2016) 81.0 38.4 45.2 59.2 81.1 37.1 45.8 59.4
VQA LSTM Q+I Antol et al. (2015) 80.5 36.8 43.1 57.8 80.6 36.5 43.7 58.2
SANYang et al. (2016) 79.3 36.6 46.1 58.7 - - - 58.9
AYN Malinowski et al. (2017) 78.4 36.4 46.3 58.4 78.2 37.1 45.8 59.4
NMN Andreas et al. (2016) 81.2 38.0 44.0 58.6 - - - 58.7
DMN+ Xiong et al. (2016) 60.3 80.5 48.3 56.8 - - - 60.4
iBowling Zhou et al. (2015) 76.5 35.0 42.6 55.7 76.8 35.0 42.6 55.9
Table 1: Comparison of the state-of-the-art methods with our single model performance on VQAv1.0 test-dev and test-standard server.

4 Experiments

4.1 Dataset

We perform experiments on VQAv1 Antol et al. (2015) and VQAv2 Goyal et al. (2016) both of which are large scale VQA datasets. VQAv1 contains over 200K images from the COCO dataset with 610K natural language open-ended questions. VQAv2 Goyal et al. (2016) contains almost twice as many question for the same number of images. VQAv2 has a balanced image-question pair to mitigate the language bias that allows a more realistic evaluation protocol. Visual Genome is another larger scale dataset that has image question pair with dense annotation of objects, attributes Krishna et al. (2016). We train a pretrained faster RCNN model (on ImageNet) again on Visual Genome dataset with class and attribute labels to extract object level features from the input image.

4.2 VQA Model Architecture

Question Feature Embedding: Our model embeds the question features by first generating the questions and answer dictionary from training and validation set of the VQA datasets. We make the question and answers lower case, remove punctuation and perform other standard preprocessing steps before tokenizing the words, and representing them into one-hot vector representation. As mentioned in Section 3.1, these question embeddings are fed to GRUs pretrained with Skip-thoughts Kiros et al. (2015) model that generates -d language feature embeddings for the given question. When experimenting with VQAv1 and VQAv2, we parse questions respectively from training and validation sets to create the question vocabulary.

Answer Encoding: We formulate the VQA task as a classification task. We create an answer dictionary from the training data and select the top answers as the different classes. We pass the output of the final fusion layer through a convolutional layer that outputs a d vector. This vector is passed through the classifier to predict .

We use Adam solver Kingma and Ba (2014) with base learning rate of and batch size of for our experiments. We keep the training parameters same for all our experiments. We use NVidia Tesla P100 (SXM2) GPUs to train our models and report our experimentation results on VQAv1 Antol et al. (2015) and VQAv2 Goyal et al. (2016) dataset representing GPU hours of computation.

Methods Test-dev Test-standard
Y/N No. Other All Y/N No. Other All
RAF (Ours) 84.1 44.9 57.8 67.2 84.2 44.4 58.0 67.4
BU, adaptive K Teney et al. (2017) 81.8 44.2 56.1 65.3 82.2 43.9 56.3 65.7
MFB Yu et al. (2018) - - - 64.9 - - - -
ResonNetIlievski and Feng (2017) - - - - 78.9 42.0 57.4 64.6
MUTANBen-Younes et al. (2017)222Performance on VQAv2 is evaluated from their publicly available repository. 80.7 39.4 53.7 63.2 80.9 38.6 54.0 63.5
MCB Fukui et al. (2016); Goyal et al. (2016) - - - - 77.4 36.7 51.2 59.1
HieCoAtt Lu et al. (2016); Goyal et al. (2016) - - - - 71.8 36.5 46.3 54.6
Language onlyGoyal et al. (2016) - - - - 67.1 31.6 27.4 44.3
Common answerGoyal et al. (2016) - - - - 61.2 0.4 1.8 26.0
Table 2: Comparison of the state-of-the-art methods with our single model performance on VQAv2.0 test-dev and test-standard server.

5 Results

We evaluate the proposed models’ performance on the VQA test servers which ensures blind evaluation on the VQAv1 Antol et al. (2015) and v2 Goyal et al. (2016) test sets (i.e. test-dev, test-standard) following the VQA benchmark evaluation approach. The accuracy of the predicted answer is calculated with the following formulation:


which means that answer provided by the model is given 100% accuracy if at least 3 human annotators who helped create the VQA dataset gave the exact answer.

In Table 1, we report VQAv1 test-dev and test-standard accuracies for our proposed RAF model and compare it with other single models found in literature. Remarkably, our model outperforms all other models in the overall accuracy. We report a significant performance boost of on the test-dev set and on the test-standard set. It is to be noted that using multiple ensembles and data augmentation with complementary training in Visual Genome QA pairs can increase the accuracy performance of the VQA models. For instance, MCB Fukui et al. (2016), MLB Kim et al. (2016), MUTAN Ben-Younes et al. (2017) and MFB Yu et al. (2018) employ similar model ensemble consisting of 7,7,5 and 7 models respectively, and report overall , , and on the test-standard set. It is interesting to note that except for MFB (7) all other ensemble models are less than our reported single model performance. We do not ensemble our model or use data augmentation with complementary dataset as it makes the best results irreproducible and most of the models in the literature do not adopt this strategy.

We also evaluate our model on VQAv2 test-standard dataset and compare it with state-of-the-art single model performance in Table 2, illustrating that our model surpasses the closest method Teney et al. (2017) in all question categories and overall by a significant margin of . The bottom up, adaptive-kTeney et al. (2017) is the same model whose 30-ensemble version Anderson et al. (2018) reports currently the best performance among on VQAv2 test-standard dataset. This indicates our models superior capability to interpret and incorporate multi-modal relationships for visual reasoning.

In summary, our model achieves state-of-the-art performance on both VQAv1 and VQAv2 dataset which affirms the robustness of our model against language bias without the need of data augmentation or the use of ensemble model. We also show qualitative results in Fig. 4 to demonstrate the efficacy and complimentary nature of attention focused on image-grid and object proposals.

Cat. Methods Val-set I RAF-I(ResNet) 53.9 HieCoAtt Lu et al. (2016); Goyal et al. (2016) 54.6 RAF-I(ResNeXt) 58.0 MCB Fukui et al. (2016); Goyal et al. (2016) 59.1 MUTAN Ben-Younes et al. (2017) 60.1 II Up-DownAnderson et al. (2018) 63.2 RAF-O(ResNet) 63.9 III RAF-IO(ResNet-ResNet) 64.0 RAF-IO(ResNeXt-ResNet) 64.2
Table 3: Ablation Study on VQAv2 val-set.
Figure 3: Accuracy vs. Complexity (no. of parameters) comparison.

5.1 Ablation Study

We perform an extensive ablation study of the proposed model on VQAv2 Goyal et al. (2016) validation dataset and compare it with the best performing model in Table 3. This ablation study helps to better understand the contribution of different components of our model towards the overall performance on the VQA task. The objective of this ablation study is to show that when the language features are combined with image- grid and object level visual features, the accuracy of the high level visual reasoning task (i.e. VQA) increases in contrast to only combining language with image or object level features. The models reported in Category I in Table 3 use only image level features extracted with deep CNNs and we compare RAF-I which is a variant of our proposed RAF architecture only using image level features. We observe RAF-I achieve comparable performance in this category. In Category II, RAF-O model extracts only object level features but outperforms the models in Category I. Anderson et al. (2018) also used only object level features and this variant of our model achieves comparable performance to that model. When we combine image and object level features together in Category III, we observe that the best results are obtained. This proves our hypothesis that the questions relate to both objects, object parts and local attributes, which should be attended for jointly an improved VQA performance.

The recent Dual-MFA Lu et al. (2017) model also uses complementary image and object-level features. In contrast, our model uses more efficient bimodal attention fusion mechanism and exhibit robustness on balanced VQAv2 Goyal et al. (2016) dataset. We also study the accuracy vs. complexity (no. of parameters) trade off in Fig. 3 on VQAv1 test-dev set as most of the bilinear models do not report performance on VQAv2. Remarkably, our RAF model achieves significant performance boost over Dual-MFA (66% to 68%) with around half the complexity.

Figure 4: Qualitative results of the proposed Reciprocal Attention Fusion mechanism for Visual Question Answering. Given a question and an image (columns: ), attention based on image-grid (columns: ) and object proposals (columns: ) is shown above. Correct and incorrect answers are shown in green and red, respectively. Remarkably, the two attention levels provide complementary information about localized regions and objects that in turn help in obtaining the correct answer (rows: ). In some failure cases of our technique, ambiguous attention maps lead to incorrect predictions (row: ).

6 Conclusion

We build our proposed model based on the hypotheses that multi-level visual features and associated attention can provide an AI agent additional information pertinent for deep visual understanding. As VQA is a standard measure of image understanding and visual reasoning, we propose a VQA model that learns to capture the bimodal feature representation from visual and language domain. To this end, we employ state of the art CNN architectures to obtain visual features for local regions on the image-grid and object proposals. Based on these feature encodings, we develop a hierarchical co-attention scheme that learns the mutual relationships between objects, object-parts and given questions to predict the best response. We validate our hypotheses by evaluating the proposed model on two large scale VQA dataset servers followed by an extensive ablation study reporting state-of-the art performance.


The authors would like to thank CSIRO Scientific Computing team, especially Peter Campbell and Ondrej Hlinka, for their assistance in optimizing the work-flow on GPU clusters.