Inverse Visual Question Answering with Multi-Level Attentions

09/17/2019 ∙ by Yaser Alwatter, et al. ∙ Carleton University 0

In this paper, we propose a novel deep multi-level attention model to address inverse visual question answering. The proposed model generates regional visual and semantic features at the object level and then enhances them with the answer cue by using attention mechanisms. Two levels of multiple attentions are employed in the model, including the dual attention at the partial question encoding step and the dynamic attention at the next question word generation step. We evaluate the proposed model on the VQA V1 dataset. It demonstrates state-of-the-art performance in terms of multiple commonly used metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inverse Visual Question Answering (iVQA) [18, 19]

is a new interesting problem emerged in computer vision after the great success of deep learning on the VQA task

[3, 1, 17, 12]. iVQA can be viewed as an extension to the standard task of Visual Question Generation (VQG) [23, 34, 9]. However, different from the VQG task, which automatically generates general questions from given images, iVQA restricts the question generation to be correspondent to a pair of image and answer. It has been noticed that a VQA system could predict correct answers merely based on the textual questions without considering the corresponding images [7], which indicates the limitation of VQA on image understanding. By contrast, an answer phrase is typically much shorter than the question, it is necessary to integrate both visual image understanding and language semantic understanding to yield an effective iVQA system. This makes iVQA more challenging than VQA as both image and answer, as well as their complex interactions, must be captured. It provides a platform for enhancing machine visual and textual understandings. Meanwhile iVQA systems can be used as inexpensive tools to generate questions and create richer datasets for VQA.

As a newly emerged task, iVQA has been studied in a few previous works [19, 18] since its introduction in [19]. The work in [19] have tackled iVQA by enhancing the answer cue with high-level semantic information only at the initial step of question generation. The question generation model then dynamically attends to important parts of a given image based on the current generation state and the answer cue. The follow-up work in [18]

further uses variational autoencoders to enhance question diversity of iVQA. In these works, the complex interaction and consistent alignment of the textual semantic features with useful object level visual features, which are important for effective attentions, however remain unexplored. The potential selection of regions that are not informative or aligned with objects may negatively affect the question generation.

In this paper, we propose a novel multi-level attention model to address inverse visual question answering. The proposed model induces consistent regional textual and visual feature alignment with the answer cues via attention mechanisms. It extends a bottom-up attention model [1] to extract visual features at the level of objects and salient regions, as well as generate object and attribute labels, i.e., semantic features (or textual features). Different from previous iVQA models [19, 18] that simply integrate the answer and high-level semantic concepts to provide an initial glimpse for the partial question encoder, we deploy a dual guiding attention module to attend to the relevant object-level parts of both visual features and semantic features based on the answer cue for partial question encoding. At the question generation level, we first statically attend the concatenation of visual and semantic features with the answer cue, and then deploy a dynamic attention module to attend to important regions for the next question word generation. We expect such a multi-level attention based model is able to generate visual informative and answer-consistent questions by enforcing a correct alignment between the visual features and textual answer cues and achieve better image and text understanding. To validate the effectiveness of the proposed model, we conduct experiments on the VQA V1 dataset [2]. The results show that our proposed model achieves the state-of-the-art performance in terms of many benchmark metrics comparing with other competitors [19, 18].

2 Related Work

iVQA is closely related to the tasks of image caption generation and visual question answering. In this section, we provide a brief review over the related works and techniques for image caption generation, visual question answering, and visual question generation.

Image Caption Generation. The standard CNN-RNN model [27]

is considered the base for most state-of-the-art image captioning models. It uses a convolutional neural network (CNN) for visual feature extraction and uses recurrent neural networks (LSTM) for caption generation. After the initial success of this simple model, Xu  

[29] improved it by introducing a attention mechanism which helps the model to selectively focus on important parts of the image rather than consider its whole content. This powerful technique has inspired researchers to study different attention variants. You [31] presented a semantic based attention strategy that helps the model to attend to the important semantic concepts of the input image. Lu [21] proposed a more intuitive attention mechanism, namely adaptive attention. Their model enables the visual attention when there is a need to consider the visual cue and disables it when the visual cue does not correspond to the currently generated word. More recently, considerable improvements in image captioning have been achieved by combining both bottom-up and top-down attentions [1]. The models developed for image captioning though are not directly applicable on other types of tasks such as VQA, VQG, and iVQA, the principles of the techniques can be adapted and generalized.

Visual Question Answering. VQA is similar to iVQA in the way that both tasks should capture complex textual and visual interactions and translate them into written words. Unlike image captioning, the primary challenge of VQA is to extract multi-modal joint features that sufficiently express relationships between the visual content and textual content. To solve this problem, Zhou [35] used the conventional concatenation method to combine textual and visual features. The moderate results of this straightforward method urged researchers to find other advanced techniques. One of such techniques is bilinear-pooling which allows both visual and textual features to fully interact with each other. However, the direct implementation of this method is computationally expensive and hence not applicable to the complex VQA task. Many algorithms were then developed to reduce the number of parameters needed to perform the bilinear-pooling operation. For example, the Multi-modal Low-rank Bilinear (MLB) pooling [12]

decomposes the high dimensional weight tensor needed for the bilinear-pooling operation into three lower dimensional weight matrices which can easily fit into a GPU memory. Another method, Multi-modal Factorized Bilinear-pooling (MFB)

[33]

attempts to capture complex interactions by first expanding textual and visual features into a high-dimensional feature vector and then squeezing it into a low-dimensional feature vector. MFB has demonstrated good performance for many state-of-the-art VQA models. In this work we also exploit MFB to capture complex interactions between answer and image cues.

Similar to image captioning, the attention mechanism also plays an important role in boosting VQA performance. Instead of considering the whole content of an image equally, the attention mechanism helps VQA models to focus on parts corresponding to the given question [4]. Many visual attention variants, e.g., the Stacked attention [30] and Hierarchical co-attention [22], have been proposed for VQA. Besides visual attention, semantic attention has also been explored; Yu [32] designed a multi-level visual-semantic attention model that focuses on both important regions and semantic concepts corresponding to a given question. Nevertheless, as previously discussed, the rich semantic information contained in the questions could make a VQA system produce the right answers without any image visual understanding [7], which motivates the new task of iVQA.

Visual Question Generation.

VQG is a new deep learning task that aims to build an artificial intelligence system that can recognize the content of an image and generate relevant questions. One of the first research works that focused on this task was conducted by Mostafazadeh

[23] who repurposed the standard CNN-RNN model for the task of VQG. Similar to that for image captioning, this model became the base for many advanced state-of-the-art VQA models. To enhance question diversity, the model in [34] leveraged DenseCap [10] to generate questions that are corresponding to the main objects in a given image. Jain [9] improved question diversity by integrating the vanilla variational autoencoder [14] into their model. Nevertheless, the aforementioned models generate questions merely based on images. Although this is very helpful for enhancing visual and linguistic understandings, the generated questions may be too generic and lack sufficient understanding of the given image. By contrast, the iVQA task proposed in [19] is more challenging as an iVQA model requires precise understanding of a given image and good reasoning to generate specific questions that are corresponding to an image-answer pair. Two main approaches have been proposed for iVQA [19, 18]. In [19], a powerful model is developed by incorporating two key components, the semantic concepts that are used together with a given answer to provide an initial first glance for question generation, and the dynamic attention that dynamically focuses on important parts of a given image while generating the question word by word. In [18], the authors leveraged a variational autoencoder to improve the capability of generating multiple corresponding questions. After developing both VQA and VQG models, Li [16] exploited the complementary relationship of the two tasks and proposed the first model that can work with both VQA and iVQA. This study constituted an important step to confirm the importance of iVQA in improving VQA.

Figure 1: The proposed multi-level attention iVQA model. The model consists of the following components: (1) GRU-based answer encoder, which encodes the answer phrase into the semantic embedding space. (2) Faster R-CNN based image encoder, which produces visual and semantic features at the object and salient regions. (3) Dual guiding attention, which produces answer attended visual and semantic features. (4) Multi-modal feature fusion (MFB), which fuses the answer cue with the visual and semantic features. (5) Two-layer GRU model, which recurrently generates the sequence of question words with a dynamic attention mechanism.

3 Approach

Given an image and an answer phrase , inverse question answering (iVQA) generates a corresponding question

by maximizing the following conditional probability:

(1)

where denotes the model parameters.

3.1 iVQA Model with Multi-Level Attentions

In this section, we present a novel multi-level attention model for inverse question answering. The architecture of the proposed model is illustrated in Figure 1.

Comparing to previous iVQA methods, our proposed model has strengths in the following aspects: (1) We encode the input image with a Faster R-CNN module, which generates both object level semantic labels (e.g., “brown horse”) and associated regional visual features. For iVQA, this image encoder enables consistent alignment of the visual image features and textual answer cue, while providing a suitable foundation for regional selective attention mechanisms. (2) We use a dual guiding attention mechanism to attend the visual and semantic features based on the answer cue for partial question encoding. (3) We adopt the powerful Multi-modal Factorized Bilinear-pooling (MFB) [33] method to capture complex interactions between image and answer features and fuse them into answer-aware image representations for dynamic attention and question decoding (next word generation). Below we present the key component modules of the proposed iVQA model.

Figure 2: Illustration of the Faster R-CNN image encoder module. The detected objects are illustrated in the bottom picture. Each object is localized with a bounding box and annotated with attribute and object labels.

3.1.1 Semantic Answer Encoder

Comparing to questions, answers are typically very short phrases such that

. Each word in the phrase has semantic meanings. To produce a semantically informative encoding for an answer phrase, we propose to exploit a natural language processing technique, GloVe

[24]

, to encode each word into a 300-dimensional semantic embedding vector. Moreover, we use a recurrent neural network, Gated Recurrent Units (GRUs)

[6], to capture the dependencies between the sequence of words in an answer phrase and use the hidden state output of the last GRU cell as the answer encoding vector, , with .

3.1.2 Semantic and Visual Image Encoder

Different from previous iVQA approaches [19, 18], which use visual features extracted from deep CNN models, ResNet [8], we adopt an object-based regional feature extraction model, Faster R-CNN [25] (with ResNet101 [8]

pre-trained on ImageNet

[26]), to perform image encoding. Faster R-CNN has been used as a state-of-the-art tool for object detection in computer vision. We use it to extract object-level visual features and associated object and attribute labels from input images. The process is illustrated in Figure 2. This encoding module has two main steps. In the first step, the Faster R-CNN generates region proposals with their corresponding bounding boxes. Specifically we first pass a given image into a residual network (ResNet101) to extract a spatial feature map, which is then passed into a Region Proposal Network (RPN) to generate object proposals with a sliding window. For each window, there are multiple anchors that consider different possibilities for a potential object location. At each anchor, RPN uses binary classification to predict the potential of the current anchor forming an object-of-interest. Whenever RPN finds an object, it predicts a bounding box to localize this object. In the second step, the shapes of the region proposals are unified by applying a region-Of-Interest (ROI) pooling layer. The generated visual features are then passed to the final classification module, which predicts objects and their corresponding attribute classes (“brown horse”), while refining the corresponding boxes with regression.

The model is trained on the Visual Genome dataset [15]. For each image, we apply the encoder model to extract visual features from the top most probably predicted object regions and obtain final visual features, . To enhance these visual features with semantic information, we use the predicted attribute label (e.g., “brown”) and object label (e.g., “horse”) on each of these regions. Similar to the answer encoder, we encode each label into a 300-dimensional embedding vector with GloVe [24]. This leads to k regional attribute embeddings, , and object label embeddings, . Together they form the textual semantic features, , in the same semantic space as the answer cue. The visual and semantic features can also be concatenated to form enhanced image representation , such that .

3.1.3 Dual Guiding Attention

Given the outputs of the answer encoder and image encoder, , question generation is conducted by using a two-layer GRU model. In order to provide the GRU model an initial idea about the importance of the semantic and visual concepts based on the answer cue and guide the question generation, we use a dual guiding attention mechanism to attend the visual and semantic representations in parallel.

Visual Attention. The visual attention aims to attend to the most important objects in the image based on the answer cue. The attention weight for each regional visual vector, , is computed as follows:

(2)

where , , , and are the model parameters, and

denotes the Hadamard product operator. We then pass the visual attention weights into a softmax layer to compute the attended visual representation of the image,

:

(3)

Semantic Attention. We compute the attended semantic representation of the image in a similar way. The attention weight for each semantic feature vector, , is computed as follows:

(4)

where , , , and are the model parameters. The attended semantic features, , is then computed as:

(5)

The attended visual and semantic features are then concatenated to form the input features, , for the partial question encoding GRU.

3.1.4 Multi-Modal Feature Fusion

To capture complex interactions between the image and answer cue, we use a multi-modal factorized bilinear pooling (MFB) operation [33] to fuse the image and answer representation vectors. MFB is developed to reduce the computational cost of the standard bilinear-pooling operation between question and visual features for visual question answering. We adopt MFB to fuse the answer encoding vector and the enhanced image representation .

MFB is a two-step algorithm with the expansion and squeeze steps. For each enhanced visual feature vector and the answer encoding vector , we first project them into high dimensional space to fuse and then squeeze the fused vector using a sum pooling operation with window size :

(6)
(7)

where and are the high dimensional projection weight matrices, and are the bias parameter vectors, is the MFB parameter ( 5). The MFB feature vector is then processed by a signed square root normalization and normalization.

3.1.5 Recurrent Question Generation

Given the concatenation of the attended visual and semantic vectors, , and the fused image and answer representation , we deploy a two-layer Gated recurrent unit (GRU) model for question generation, which generates the sequence of question words, , recurrently. This module has three main components for each recurrent generation operation: partial question encoder, dynamic attention, and question decoder.

Partial Question Encoder. At each recurrent step, a GRU encoder is deployed to encode the current partial question generated, aiming to capture the long-term dependencies in the question word sequence for next word generation. Given the question word generated from the immediate previous recurrent step, we first embed into the same semantic embedding space as the answer cue with a pre-trained GloVe model [24], which produces an embedding vector . Specifically, we perform the word embedding as follows:

(8)

where is initialized from the word embedding matrix produced by GloVe, denotes the vocabulary size,

is the bias vector, and

denotes an one-hot vector with length , which has a single for the entry corresponding to word .

Then we update the recurrent question encoding state by using the concatenation of the word embedding , answer attended visual and semantic features , and the previous question decoding state as input:

(9)
(10)

Note, we further emphasize the answer cue by adding it to the previous hidden state vector of the encoder GRU.

Dynamic Attention. Given the current partial question encoding state, we use an attention module to dynamically assign higher weights to important regional visual features that are corresponding to current partial question context. Specifically, the attention weights are determined by a triplet of image, answer, and partial question encoder state. We compute them as follows:

(11)

where , , and are the model parameters. The attention weights are further normalized by a softmax function to compute the dynamically attended image visual features:

(12)

Note that the attention weights are determined by two factors, and . The former fused the answer cue into each regional image representation vector with the MFB operation, while the latter dynamically reflects the current partial question context.

Question Decoder. In the top GRU decoding layer, which is referred to as language GRU, we concatenate the dynamically attended image visual feature vector with the output state of the GRU encoder to update the state of the recurrent GRU decoder:

(13)

where is the previous hidden state at step and it is set to zero when .

To predict the next word probability distribution over the set of vocabulary words, we further pass the state vector of the GRU decoder through a fully connected layer and a softmax layer:

(14)

where and are the model parameters. With beam size =1, the next -th word can then be generated as:

(15)

where VOC denotes the vocabulary table.

3.2 Training and Inference

We perform an end-to-end deep training over a set of training instances. The -th instance is a triplet that includes an image , an answer phrase , and a ground truth question . We train the proposed model by minimizing the cross-entropy loss:

(16)

where represents all the model parameters, and is determined from Eq.(14).

During the test phase, given the trained model and an image-answer pair, we recurrently predict the probability distribution of the next word in the question sequence over our vocabulary using Eq.(14). We use a beam size 1 (greedy search) to select the most probable word as the output, as in Eq.(15). The question generation terminates after generating the question mark.

Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
SAT [19] 0.417 0.311 0.241 0.192 0.195 0.456 1.533
iVQA+VAE [18] 0.421 0.320 0.253 0.205 0.201 0.466 1.682
iVQA [19] 0.430 0.326 0.256 0.208 0.205 0.468 1.714
Ours 0.452 0.339 0.262 0.207 0.209 0.472 1.673
Table 1: Comparison results between our proposed iVQA model and the other three comparison models on the popular Karpathy test split. The numbers in bold denote the highest scores.

4 Experiments

In this section we present our empirical study and report both quantitative and qualitative results.

4.1 Experimental Setting

Dataset. We used the VQA V1 dataset [2] to conduct experiments and evaluate the proposed approach. This dataset was initially created for visual question answering and repurposed for iVQA [19] later. The images are collected from the MS COCO dataset [5]. The dataset is divided into three parts: training set, validation set and testing set. The training set consists of 82,783 images. The validation and testing sets include 40,504 and 81,434 images respectively. The answers however are not available for the testing set. Therefore, we followed [19, 1, 11] to use the Karpathy data split, and get 5000 images for testing and 5000 images for validation. Following [19], we use three question-answer pairs for each image. This results in about 250k instances ((image, answer, question)-triplets) for training and 15k instances for testing.

Implementation Details. Before conducting experiments, we first performed pre-processing over the data. We follow [20]

to extract the most frequent 3,000 answers from the training set and then build the vocabulary from the questions corresponding to these answers. This leads to a vocabulary set of about 12,900 words. The answers and questions are tokenized into lists of words. We standardized the lengths of questions and answers to be 19 and 3 respectively by padding zeros to shorter sequences and trimming longer sequences. To extract the visual and semantic information, we use the publicly available bottom-up attention model, Faster R-CNN, in

[1]. We use the model to extract the top salient regions of visual features and their corresponding semantic information .

We performed training by using one GPU with 16 GB of memory. We fed the data in batches of 1000 instances. We use hidden units for each GRU cell and for each attention module. We trained our model by using the back-propagation algorithm with the Adam [13] optimizer. During training, we initially set the learning rate as

. After the fifth epoch, we decreased it to

and stopped training at the 14th epoch.

Model Evaluation.

We evaluate our model by using the standard evaluation metrics, specifically, BLEU, METEOR, ROUGE-L and CIDEr

111https://github.com/tylin/coco-caption. These metrics are used extensively for image captioning evaluation [5] and have also been used for iVQA evaluation [19].

4.2 Quantitative Results

To validate the effectiveness of the proposed model, we compare its performance with one baseline and two iVQA competitors [19, 18]. The baseline is the powerful Show Attend and Tell (SAT) model [28], which was mainly developed for the task of image captioning and modified by [19] to take a given answer cue. The model is compared to ours in the context of using dynamic attention to focus on different salient regions at each time step. The competitors are the original iVQA method [19] and its modified version (iVQA+VAE) [18] which incorporates the variational autoencoder (VAE) to increase the question diversity. We report the comparison results in Table 2.

From the comparison scores, we can see that the proposed model with the multi-level attentions and semantically enhanced visual features performs better than the baseline SAT by remarkable margins. This suggests that the iVQA task presents different challenges than the caption generation task, while a state-of-the-art image captioning model can fail to produce effective results on the iVQA task even after adaptation. Moreover, we can also see that the proposed model also outperforms the state-of-the-art iVQA models, iVQA and iVQA+VAE, in terms of most of the linguistic evaluation metrics, including BLEU-1, BLEU-2, BLEU-3, METEROR, and ROUGE-L. The results are also achieved under an inference setting that is favorable to the competitors – the two competitors used a beam search of size 3, while we only used a beam search of size 1. These impressive results demonstrate the efficacy of the proposed multi-level attention model.

Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
Variant w/o {S, att_0} 0.450 0.336 0.259 0.204 0.208 0.467 1.649
Full Model 0.452 0.339 0.262 0.207 0.209 0.472 1.673
Table 2: Performance comparison of our main model variants.

4.3 Ablation Study

The proposed model incorporates additional semantic information based on the object and attribute labels. It uses to produce enhanced visual features and obtain through dual guiding attention. To investigate whether and how much this semantic information contributes to the major performance gain for the proposed model, we create a variant of the proposed model by dropping the semantic information and dual guiding attention features . We compared this variant with the full model and reported the results in Table 2. We can see that the performance of the variant model is comparable to the state-of-the-art iVQA model [19]. It actually outperforms the iVQA model in terms of BLEU-1, BLEU-2, BLEU-3 and METEOR. This validates the visual features extracted at the object-level salient regions more suitable for the task of iVQA than the visual features extracted with conventional deep networks. Nevertheless, the full model still beats the variant in terms of all evaluation metrics. This suggests that the use of semantic information and dual guiding attention is beneficial for iVQA. Enhancing visual features with semantic information that has the same textual modality with the answer and generated partial question is useful for our model to precisely attend to corresponding objects, and consequently improve the overall performance.

(a) (a) Answer: green
(b) What
(c) color
(d) is
(e) the
(f) sign
(g) (b) Answer: red
(h) What
(i) color
(j) is
(k) the
(l) hydrant
(m) (c) Answer: turban
(n) What
(o) is
(p) the
(q) man
(r) wearing
(s) on
(t) his
(u) head
(v) (d) Answer: hay
(w) what
(x) is
(y) the
(z) horse
(aa) eating
Figure 3: Examples of question generation process by our model with corresponding dynamic attention visualization. The red and white bounding boxes localize regions with the largest and second largest attention weights respectively. White regions have very low attention weights.

4.4 Qualitative Analysis

To illustrate the function of our model’s attention mechanism in generating corresponding questions for given image-answer pairs, we demonstrate a few examples of the question generation process of our proposed approach with dynamic attention visualization in Figure 3. The red box marks the salient region with the largest attention weight, while the white box marks the salient region with the second largest attention weight, according to Eq.(11).

For the first two examples, we used the same image but with two different answers belonging to the same category (colour) to check the effectiveness of our proposed multi-level attention mechanism in capturing the fine-grained visual information based on a given answer. We can see that the proposed model generates very suitable corresponding questions for the given answers. Importantly, through our attention mechanism, the answer cue guide the model to attend to the exact corresponding objects (the green sign in the first example and red hydrant in the second).

In the third example, we used a more complex answer, specifically ”turban”. We notice that our model can semantically understand that the turban represents an object that people wear on their heads. Consequently, we can see our model tends to ignore most parts of the image and only focuses on those objects corresponding to the currently predicted word (, man, wearing, head). Similarly in the last example, the model focuses on the horse object while generating the word “horse” and on the horse’ head and mouth while generating the word “eating. These examples demonstrate that our approach has the capacity to correlate a given answer with specific corresponding regions in a given image. Meanwhile, it also has the capacity to dynamically change its attention according to the previously generated word. These results again validated the effectiveness of the proposed model.

Another issue that is worth mentioning is that for a given image-answer pair, intuitively there could be multiple reasonable questions. For example, in the first example in Figure 3, given the answer “green”, one could ask “What color is the tree leaves”. However, although this question makes sense, it misses the focus of the image, while our model generates the question focusing on the more salient region.

Nevertheless, given an image and a very short answer phrase, a human can ask multiple questions with similar qualities. To examine the capacity of the proposed model in generating diverse questions for the same given image and answer pair, we used a beam size of 4 to perform inference and present the top 3 questions with the highest probability scores. In Figure 4, we presented two examples. We can see that the short answer phrase admits enormous possibilities for generating corresponding questions. The top questions generated by our model all seem to be very reasonable, while focusing on the salient objects.

(a) Answer: yes

Is this a dessert?
Is this a cake?
Is there a cake?
(b) Answer: chinese

What type of cuisine is this?
What kind of cuisine is this?
What type of food is this?
Figure 4: Examples of the diverse questions generated by our proposed model. Questions in bold refer to the top questions selected by our model.

5 Conclusion

In this paper, we proposed a novel multi-level attention model to tackle the new task of iVQA, which has broader capacity for interacted visual and textual understanding than VQA. Different from existing iVQA approaches, we enhance the visual representation extracted at the level of objects with their corresponding semantic information to reduce the semantic gap between visual and textual modalities. In addition, we deploy a dual guiding attention to help the model focus on the parts corresponding to a given answer and use a dynamic attention to attend to the most relevant image visual features in each recurrent question generation context. We conducted experiments on the VQA V1 dataset to compare the proposed model with existing iVQA methods. The results show the effectiveness of the proposed approach by demonstrating state-of-the-art performance.

References

  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §1, §1, §2, §4.1, §4.1.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In ICCV, Cited by: §1, §4.1.
  • [3] H. Ben-younes, R. Cadène, M. Cord, and N. Thome (2017) MUTAN: multimodal tucker fusion for visual question answering. In ICCV, Cited by: §1.
  • [4] K. Chen, J. Wang, L. Chen, H. Gao, W. Xu, and R. Nevatia (2015) ABC-CNN: an attention based convolutional neural network for visual question answering. CoRR abs/1511.05960. Cited by: §2.
  • [5] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.00325. Cited by: §4.1, §4.1.
  • [6] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, Cited by: §3.1.1.
  • [7] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In CVPR, Cited by: §1, §2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.2.
  • [9] U. Jain, Z. Zhang, and A. G. Schwing (2017) Creativity: generating diverse questions using variational autoencoders. In CVPR, Cited by: §1, §2.
  • [10] J. Johnson, A. Karpathy, and F. Li (2016)

    DenseCap: fully convolutional localization networks for dense captioning

    .
    In CVPR, Cited by: §2.
  • [11] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: §4.1.
  • [12] J. Kim, K. W. On, W. Lim, J. Kim, J. Ha, and B. Zhang (2017) Hadamard product for low-rank bilinear pooling. In ICLR, Cited by: §1, §2.
  • [13] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  • [14] D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In ICLR, Cited by: §2.
  • [15] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. Inter. Journal of Computer Vision 123, pp. 32–73. Cited by: §3.1.2.
  • [16] Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, and X. Wang (2018) Visual question generation as dual task of visual question answering. In CVPR, Cited by: §2.
  • [17] Y. Lin, Z. Pang, D. Wang, and Y. Zhuang (2018) Feature enhancement in attention for visual question answering. In IJCAI, Cited by: §1.
  • [18] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun (2018) Inverse visual question answering: a new benchmark and VQA diagnosis tool. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §1, §1, §2, §3.1.2, Table 1, §4.2.
  • [19] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun (2018) iVQA: inverse visual question answering. In CVPR, Cited by: §1, §1, §1, §2, §3.1.2, Table 1, §4.1, §4.1, §4.2, §4.3.
  • [20] J. Lu, X. Lin, D. Batra, and D. Parikh (2015) Deeper LSTM and normalized CNN visual question answering model. GitHub repository. Cited by: §4.1.
  • [21] J. Lu, C. Xiong, D. Parikh, and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In CVPR, Cited by: §2.
  • [22] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, Cited by: §2.
  • [23] N. Mostafazadeh, I. Misra, J. Devlin, L. Zitnick, M. Mitchell, X. He, and L. Vanderwende (2016) Generating natural questions about an image. In ACL, Cited by: §1, §2.
  • [24] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §3.1.1, §3.1.2, §3.1.5.
  • [25] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, pp. 1137–1149. Cited by: §3.1.2.
  • [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. Inter. Journal of Computer Vision 115, pp. 211–252. Cited by: §3.1.2.
  • [27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In CVPR, Cited by: §2.
  • [28] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §4.2.
  • [29] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §2.
  • [30] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola (2016) Stacked attention networks for image question answering. In CVPR, Cited by: §2.
  • [31] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In CVPR, Cited by: §2.
  • [32] D. Yu, J. Fu, T. Mei, and Y. Rui (2017) Multi-level attention networks for visual question answering. In CVPR, Cited by: §2.
  • [33] Z. Yu, J. Yu, J. Fan, and D. Tao (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV, Cited by: §2, §3.1.4, §3.1.
  • [34] S. Zhang, L. Qu, S. You, Z. Yang, and J. Zhang (2017) Automatic generation of grounded visual questions. In IJCAI, Cited by: §1, §2.
  • [35] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus (2015) Simple baseline for visual question answering. CoRR abs/1512.02167. Cited by: §2.