Faithful Multimodal Explanation for Visual Question Answering

09/08/2018 ∙ by Jialin Wu, et al. ∙ The University of Texas at Austin 9

AI systems' ability to explain their reasoning is critical to their utility and trustworthiness. Deep neural networks have enabled significant progress on many challenging problems such as visual question answering (VQA). However, most of them are opaque black boxes with limited explanatory capability. This paper presents a novel approach to developing a high-performing VQA system that can elucidate its answers with integrated textual and visual explanations that faithfully reflect important aspects of its underlying reasoning while capturing the style of comprehensible human explanations. Extensive experimental evaluation demonstrates the advantages of this approach compared to competing methods with both automatic evaluation metrics and human evaluation metrics.



There are no comments yet.


page 1

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep neural networks have made significant progress on visual question answering (VQA), the challenging AI problem of answering natural-language questions about an image [Antol et al.2015]. However successful systems based on deep neural networks are difficult to comprehend because of many layers of abstraction, and the large number of parameters that are interconnected. This makes it hard to develop user trust. Partly due to the opacity of current deep models, there has been a recent resurgence of interest in explainable AI, systems that can effectively explain their reasoning to human users. In particular, there has been some recent development of explainable VQA systems [Selvaraju et al.2017, Park et al.2018, Hendricks et al.2016].

Figure 1: Example of our multimodal explanation. It highlights relevant image regions together with a textual explanation with corresponding words in the same color.

One approach to explainable VQA is to generate visual explanations, which highlight image regions that most contributed to the system’s answer, as determined by attention mechanisms [Lu et al.2016] and gradient analysis [Selvaraju et al.2017]. However, such simple visualizations do not explain how these regions support the answer. An alternate approach is to generate a textual explanation

, a natural-language sentence that provides reasons for the answer. Some recent work has generated textual explanations for VQA by training a recurrent neural network (RNN) on examples of human explanations

[Hendricks et al.2016]. A multimodal approach that integrates both a visual and textual explanation provides the advantages of both. Words and phrases in the text can point to relevant regions in the image. An illustrative multimodal explanation generated by our system is shown in Figure 1.

Some initial work on such multimodal VQA explanation is presented in [Park et al.2018] that employs a form of “post hoc rationalization” that does not truly reflect the system’s actual processing. First, a textual explanation is generated using an independent RNN that is trained on human explanations. Next, a subsequent “clean-up” step tries to connect phrases in this post-hoc explanation to actual detected objects in the image, removing phrases that cannot be properly grounded in the image. We believe that explanations should more faithfully reflect the actual processing of the underlying system in order to provide users with a deeper understanding of the system that increases trust for the right reasons, rather than simply trying to convince them of the system’s reliability [Bilgic and Mooney2005].

Several recent VQA models incorporate a Bottom-Up-Top-Down (BUTD) attention mechanism [Anderson et al.2018] that uses attention over high-level detected objects instead of low-level image features. Such detections provide high-level concepts that can form part of more faithful explanations. We have developed a novel multimodal explanation system where the visual part incorporates semantic image segmentation features. Similar to park2018multimodal (park2018multimodal), we train an RNN to produce human-like textual explanations. However, our approach is more faithful in that the generated explanations are strongly biased to include terms (words and phrases) from semantic image segments that are highly-attended to by VQA when computing the answer. This also provides direct links between terms in the textual explanation and segmented items in the image, as shown in Figure 1. The result is a nice synthesis of a faithful explanation that highlights concepts actually used to compute the answer and a comprehensible, human-like, linguistic explanation. Below we describe the details of our approach and present extensive experimental results on the VQA-X [Park et al.2018] dataset that demonstrate the advantages of our approach compared to prior work on this data [Park et al.2018] in terms of both automatic metrics and human evaluation.

Figure 2: Overview of our explanation model: (a) shows the input image; (b) illustrates the segmented instances in the image; (c) shows the highly attended segments in the VQA process; (d) shows sample detected phrases from phrase detection module; and (e) depicts the visual-textual correspondence generated in the explanation module.

Related Work

In this section, we review related work including visual and textual explanation generation and VQA.


Answering visual questions [Antol et al.2015]

has been widely investigated in both the NLP and computer vision communities. Most VQA models

[Fukui et al.2016, Lu et al.2016]

jointly embed images’ CNN features and questions’ RNN features and then train an answer classifier to predict answers from a pre-extracted answer set. Attention mechanisms are frequently applied to recognize important visual features and filter out irrelevant parts. A recent advance is the use of the Bottom-Up-Top-Down (BUTD) attention mechanism

[Anderson et al.2018] that attends over high-level objects instead of convolutional features to avoid emphasis on irrelevant portions of the image. We adopt this mechanism, but replace object detection [Ren et al.2015] with segmentation [Hu et al.2018] to obtain more precise object boundaries.

Visual Explanation

A number of approaches have been proposed to visually explain decisions made by vision systems by highlighting relevant image regions. GradCam [Selvaraju et al.2017] analyzes the gradient space to find visual regions that most effect the decision. Attention mechanisms in VQA models can also be directly used to determine highly-attended regions and generate visual explanations. Unlike conventional visual explanations, ours highlight segmented objects that are linked to words in an accompanying textual explanation, thereby focusing on more precise regions and filtering out noisy attention weights.

Textual and Multimodal Explanation

Visual explanations highlight key image regions behind the decision; however, they do not explain the reasoning process and crucial relationships between the highlighted regions. Therefore, there has been some work on generating textual explanations for decisions made by visual classifiers [Hendricks et al.2016]. As mentioned in the introduction, there has also been some work on multimodal explanations that link textual and visual explanations [Park et al.2018]. A recent extension of this work [Hendricks et al.2018] first generates multiple textual explanations and then filters out those that could not be grounded in the image. We argue that a good explanation should directly focus its explanation on covering the visual detections that influenced the system’s decision, therefore generating more faithful explanations.


Our goal is to generate more faithful multimodal VQA explanations that specifically include the segmented objects in the image that are the focus of the VQA system. Figure 2 illustrates our model’s pipeline consisting of instance segmentation, phrase detection, answer prediction, and textual explanation generation. We first segment the objects in the image and then train an answer prediction module equipped with an attention mechanism over the semantically segmented objects. We also detect possible relevant phrases for the explanation with a phrase detection module. Finally, the explanation module generates textual explanations based on the question, answer, attended image segments in the VQA process, and detected phrases. At each step, an RNN in the explanation module determines whether the next word or phrase should be based on attended image content or linguistic content learned from human explanations.


We use to denote the output of the -th layer of the fully-connected layers of the neural network, and omit if . These

layers do not share parameters. We notate the sigmoid functions as

. The subscript

indexes either elements of the segmented object sets from images or phrases in the detected phrase sets. Bold letters denote vectors, overlining

denotes averaging and [, ] denotes concatenation.

VQA Module

We base the VQA module on BUTD [Anderson et al.2018] with some simplifications. First, we replace the two-branch gated tanh answer classifier with single branch fc

layers with Leaky ReLU activation

[Maas, Hannun, and Ng2013]. In order to ground the explanations to a more precise visual regions, we adopt instance segmentation [Hu et al.2018] to segment out objects in over 3,000 common categories instead of bounding-box detection used in the original BUTD model. Specifically, we extract at most the top () objects in terms of detection scores and concatenate each object’s fc6 in the bounding box classification branch and mask_fcn[1-4] features in the mask generation branch to form a 2048-d vector. This results in an image feature set V containing 2048-d vectors for each image.

Like wu2018joint (wu2018joint), we embed the questions using a standard single-layer GRU [Cho et al.2014] into a vector q. An attention mechanism over all the image segments, taking q and V as input, is introduced in Eq. 1 to weight each instance in the image feature set V in order to compute the question-attended image feature set in Eq. 2:


Note that we adopt a sigmoid instead of the softmax in previous works [Anderson et al.2018, Wu, Hu, and Mooney2018] since there could be multiple objects that contribute to the answers.

For answer prediction, we feed the question and images’ joint representation h, which is computed as the element-wise multiplication of the embedding of the average of the features in and the question features q as shown in Eq. 3, to the answer classifier:

h (3)
y (4)

We frame answer prediction as multi-label classification [Anderson et al.2018, Wu, Hu, and Mooney2018] , where we use soft scores [Antol et al.2015] as labels to supervise the sigmoid-normalized predictions ( Eq. 4) via cross-entropy loss as shown in Eq. 5:


where the index runs over candidate answers, and the are the aforementioned soft answer scores. The soft scores allow modeling the confidences of each of the feasible answers annotated by humans, in line with the VQA evaluation metric. We only extract answer candidates that appear more than times as possible answers in the VQAv2 [Antol et al.2015] training set.

VQS pretraining. In order to provide supervision for where the VQA module should attend, we pre-train the VQA model on the training set of the VQS [Gan et al.2017] data, which is also a subset of the VQAv2 dataset where the key segmentations that are relevant to the answers are annotated. In particular, we pretrain the model on the entire VQAv2 training set and provide segmentation supervision when it exists ( the training example in is VQS dataset). Because the annotation in the VQS dataset only has the 80 categories from COCO [Lin et al.2014], we design an approach to match the segmentation supervision and the pre-extracted segmentation in our model that we assign label 1 when the an IoU between the pre-extracted segmentation and the segmentation in the VQS dataset is over than 0.5. Next, in addition to general VQA loss, we also add an attention loss that minimizes the cross entropy of and its labels.

Question and Answer Embedding

As suggested in [Park et al.2018], we also encode questions and answers as input features to the explanation module. However, directly encoding the answer vectors

can introduce extra noise to the explanation module since the predicted scores of correct answers in testing are usually much lower than during training. Therefore, to adjust for the difference in answer probabilities between training and testing, we instead regard the normalized answer prediction output as a multinominal distribution, and sample one answer from this distribution at each time step. We further re-embed it as a one-hot vector

, as shown in Eq. 6:


where the denotes one-hot embedding.

Next, we element-wise multiply the embedding of the question features q and sampled answers’ features with the question-attended image features to compute the joint representation u. Note that u faithfully represents the focus of the VQA process, in that it is derived from the VQA-attended image features.

Phrase Detection

Since our segmentation categories are mainly common nouns representing objects, the explanation generation process, if only based on these segmentations, will have to determine the activities ( throwing,  standing,  etc. ) and attributes ( red car,  tall building,  etc), which are significantly harder to learn. Therefore, to reduce these difficulties, we detect a set of significant phrases and determine if what they describe actually appears in the image. We extract frequently appearing -grams from the human explanations in our data, with the constraint that the last word must be a common noun, to ensure it can describe an object. Specifically, we first part-of-speech tag [Kitaev and Klein2018] the ground truth explanations in the training set and extract all 2–5 grams ending in an NN or NNS. We remove phrases appearing less than 10 times, resulting in 1,735 candidate phrases. We then train a multilabel classifier to predict the phrases appearing in an explanation from the the segmentation features of the image. The classifier consists of gated units with sigmoid attention over the visual features , followed by two layers, as shown in Eq. 8

To ensure the accuracy of the detected phrases, we apply a post-selection process that keeps only the high-performance phrase detectors. Specifically, we first train the phrase detector on the training set using binary cross-entropy loss for 12 epochs with the Adam optimizer

[Kingma and Ba2014]. Next, we run the detector on the validation set and regard phrases with an output score over 0.5 as detected phrases. Finally, we evaluate each phrase on the validation set and filter out those with an F1 score less than 0.7 to form our final phrase set .

In the explanation module, we re-embed at most 10 phrases with detection scores over 0.5 as additional features for textual explanation generation. The re-embedded features are constructed by summing each word’s pretrained GloVe vectors [Pennington, Socher, and Manning2014] as shown in Eq. 9. This produces a phrase set P containing d vectors in the same form as the visual features . Therefore, we apply the same question and answer embedding procedure shown in Eq. 7 to produce a joint representation .


This novel phrase detection technique provides several advantages. First, additional phrase features can provide more information to guide explanation generation by providing it direct access to embeddings of common phrases found in human explanations. Second, it can detect common, higher-level concepts useful in explanations by modeling complex compositional phrases (e.g. baseball player,  holding a bat,  etc), thereby reducing the burden of discovering these complex concepts in the explanation module.

Explanation Generation

The explanation module consists of both a visual and linguistic component each containing a two-layer LSTM similar to BUTD [Anderson et al.2018]. The first layer produces an attention vector over the elements in either the image segmentation feature set V or the previously re-embedded phrase features P

. The second layer learns a hidden representation for predicting the next word in the textual explanation from the features generated by the first layer.

Figure 3: Overview of the explanation module.

The first visual attention LSTM takes the concatenation of the language LSTM’s previous output , the average pooling of , and the previous words’ embedding as input and generates the hidden presentation . Then, an attention mechanism re-weights the image feature using the generated as input shown in Eq. 10. For the detailed module structure, please refer to [Anderson et al.2018].


For the purpose of faithfully grounding the generated explanation in the image, we argue that the generator should be able to determine if the next word should be based on image content attended to by the VQA system or on learned linguistic content.

To achieve this, we introduce a “source identifier” to balance the total amount of attention paid to the visual features versus the recurrent hidden representation at each time step. In particular, given the output from the attention LSTM and the average pooling over , we train a two-layer softmax network as demonstrated in Eq. 12 to produce a 2-d output that identifies which source the current generated word should be based on ( for the output of the attention LSTM111We tried to directly use the source weights in the language LSTM’s hidden representation and found that using works better. The reason is that directly constraining makes the language LSTM forget the previously encoded content and prevents it from learning long term dependencies. and for the attended image features).


We use the following approach to obtain training labels for the source identifier. For visual features , we assign label 1 (indicating the use of attended visual information) when there exist a segmentation

whose cosine similarity between its category name’s GloVe representation and the current generated word’s GloVe representation is over 0.6. For phrase features

, we assign label 1 when the language LSTM is generating the phrase. Given the labelled data, we train the source identifier using cross entropy loss as shown in Eq.: 13.


where the are the aforementioned labels.

Next, we concatenate the re-weighted and with the output of the source identifier as the input for the language LSTM. For the more detail module structure of language LSTM, please refer to [Anderson et al.2018].


The phrase module takes the phrase feature as input and follows the same architecture, producing a hidden representation similar to .

With hidden layers from both visual features and phrases, we model the next word’s conditional probability using Eq. 15, minimizing the cross entropy loss as shown in Eq. 16


In addition to the standard cross entropy loss, we also introduce a cover loss

to encourage the explanation module to cover all of the image segments attended to by the VQA process. Specifically, this loss function minimizes the KL divergence between the VQA attention and the normalized summation of attention of the explanation module over all time steps, as demonstrated in Eq.



where is a normalization term and is the output of source identifier at time step . This encourages the generated explanation to faithfully include as many of the visual detections used by the VQA system as possible.

Finally, when using the language LSTM to generate the final textual explanation, we use beam search with a beam size of 3.


We pre-train the phrase detector for 12 epochs with human explanations on the VQA-X dataset [Park et al.2018], which contains 29,459 question answer pairs and each pair is associated with a human explanation. Meanwhile, the VQA module is pretrained on the entire VQAv2 training set combined with VQS annotation using the Adam optimizer [Kingma and Ba2014]. We then fine-tune the parameters in the VQA model and train the explanation module using the human explanations in the VQA-X dataset by minimizing the combined loss in Eq. 18, with , and set respectively to 1.0, 0.5 and 0.2. We ran the Adam optimizer for 25 epoch with a batch size of 128 explanations. The learning rate was initialized to 5e-4 and decayed by a factor of 0.8 every three epochs.


Multimodal Explanation Generation

As a last step, we link words in the generated textual explanation to image segments in order to generate the final multimodal explanation. To determine which words to link, we extract all common nouns whose source identifier weight in Eq.12 exceeds 0.5. We then link them to the segmented object with the highest attention weight (Eq. 10) when that corresponding output word was generated, but only if this weight is greater than 0.2.222Since there are lots of duplicated segments, we assign a lower threshold.

Textual Visual
PJ-X [Park et al.2018] No 19.5 18.2 43.7 71.3 15.1
ours No 24.9 19.8 47.4 88.8 17.9 2.41
PJ-X [Park et al.2018] Yes 19.8 18.6 44.0 73.4 15.4 45.1 2.64
ours Yes 26.3 20.3 48.9 94.6 18.7 55.4 2.34
Table 1: Explanation evaluation results, the top half shows results for when the system’s output explanations use the answers predicted in the VQA process , and the bottom half for when the explanations are conditioned on the ground truth answers.
PJ-X [Park et al.2018] 19.8 18.6 44.0 73.4 15.4
BUTD-S 22.5 19.0 45.2 78.2 16.8
… + phrase detection 24.7 19.0 47.3 86.4 17.3
… + source identifier 25.1 19.8 47.2 90.1 17.8
… + cover loss 26.3 20.3 48.9 94.6 18.7
Table 2: Ablation results that successfully add in four key features of our model.

Experimental Evaluation

This section presents experimental results that evaluate both the textual and visual aspects of our multimodal explanations, including comparisons to competing methods and ablations that study the impact of the various components of our overall system.

Textual Explanation Evaluation

Automated Evaluation: Following park2018multimodal (park2018multimodal), we first evaluate our textual explanations using automated metrics by comparing them to the gold-standard human explanations in VQA-X using standard sentence-comparison metrics: BLEU-4 [Papineni et al.2002], METEOR [Banerjee and Lavie2005], ROUGE-L [Lin2004], CIDEr [Vedantam, Lawrence Zitnick, and Parikh2015] and SPICE [Anderson et al.2016]. As shown in Table 1, we outperform the current state-of -the-art PJ-X model [Park et al.2018] on all automated metrics by a clear margin. This indicates that constructing explanations that more faithfully reflect the VQA process can actually generate explanations that match human explanations better than just training to directly match human explanations, possibly by avoiding over-fitting and focusing more on important aspects of the test image.

Human Evaluation: The first Amazon Mechanical Turk (AMT) evaluation asks human judges to directly compare our textual explanations with PJ-X [Park et al.2018] and human explanations. Specifically, following park2018multimodal for fair comparison, we choose 1,000 correctly answered questions from the 1,968 questions in the VQA-X test data and ask workers to rank the explanations produced by our model, the PJ-X model, and one of the three ground truth explanations (randomly ordered), allowing for ties. Each AMT HIT (Human Inference Task) contains 5 ranking tasks, where one is a “validation” item with an obviously correct answer. For quality control, we discard data from HITs where the validation item is answered incorrectly.

We report the percentage of tasks where an automatically generated explanation ranks higher than or ties the ground truth explanation. The AMT results in Table 1 show that our explanations match or exceed the judged quality of human explanations 55% of the time compared to 45% for the PJ-X explanations. We also found that 76.8% of our explanations are ranked as high or higher than PJ-X’s.

Our second AMT experiment asked workers to compare our explanation with all three of the ground truth human explanations. We randomly choose 1,000 correctly answered questions, showed judges our explanation and all 3 human ones (randomly ordered), and asked them to rank all 4 explanations, allowing for ties. Table 3 shows the number of our explanations assigned to each rank (from best to worst).

Rank1 Rank2 Rank3 Rank4
ours 294 228 179 299
Table 3: Rank comparing to 3 human explanations.

Our explanations are clearly competitive with human explanations, 29.4% of them rank first and only 29.9% rank last.

Ablation Study: In this section, we present ablation results evaluating three different aspects of our overall model: phrase detection, source identifier, and cover loss. We successively added these aspects to our model, in this order, and evaluated the overall quality of the resulting textual explanations using the automated metrics. As shown in Table. 2, we first report the benefit of building our model on image segmentation rather than bounding-box detection by comparing our initial base model (labeled BUTD-S) to the PJ-X approach. Segmentations provide our system with higher-level concepts and more semantic information about the objects which are the focus of the VQA module, making our model both more faithful and easier to explain.

The effectiveness of phrase detection is also shown in Table 2, it improves all of the metrics except METEOR. It particularly improves BLEU-4 and CIDEr, where phrases directly take part in the evaluation metric. The source identifier helps the model by allowing it to choose which content to take from the attended image segments; and the cover loss encourages the system to include as many of these attended segments as possible, both of which improve overall performance.

Visual and Multimodal Explanation Evaluation

Automated Evaluation: As in previous work [Selvaraju et al.2017, Park et al.2018], we first used Earth Mover Distance (EMD) [Pele and Werman2008] to compare the image regions highlighted in our explanation to image regions highlighted by human judges. In order to fairly compare to prior results, we resize the all the images in the entire test split to 1414 and adjust the segmentations in the images accordingly. Next, we sum up the multiplication of attention values and source identifiers’ values in Eq, 10 over time () and assign the accumulated attention weight to each corresponding segmentation region. We then normalize attention weights over the 14 14 resized images to sum to 1, and finally compute the EMD between the normalized attentions and the ground truth.

As shown in the Visual results in Table 1, our approach matches human attention maps more closely than the PJ-X approach. We attribute this improvement to the following reasons. First, we base our approach on detailed image segmentation which avoids the risk of focusing on background and is much more precise than bounding-box detection. Second, our visual explanation is focused by textual explanation where the segmented visual objects are linked to specific words in the textual explanation. As a result, we filter out many of the noisy attentions in a purely visual explanation method.

Human Evaluation: We also asked AMT workers to evaluate our final multimodal explanations that link words in the textual explanation directly to segments in the image. Specifically, we randomly selected 1,000 correctly answered question and asked turkers “ How well do the highlighted image regions support the answer to the question?” and provided them a Likert-scale set of possible answers: “Very supportive”, “Supportive”, “Neutral”, ‘Unsupportive” and “Completely unsupportive”. The second task was to evaluate the quality of the links between words and image regions in the explanations. We asked turkers “How well do the colored image segments highlight the appropriate regions for the corresponding colored words in the explanation?” with the Likert-scale choices: “Very Well”, “Well”, “Neutral”, “Not Well”, “Poorly”. We assign five questions in each AMT HIT with one “validation” item to control the HIT’s qualities.

Figure 4: Multimodal explanation evaluation results for relevance of highlighted regions (left) and quality of the links between explanation words and image regions (right).

Fig. 4 shows the results for both questions. For both tasks, over 75% of the evaluations are positive and over 45% of them are strongly positive. This indicates that our multimodal explanations provide good connections between visual explanations, textual explanations, and the VQA process. Fig. 5 presents some sample positively-rated multimodal explanations.333Please find more examples in the supplementary material.

Figure 5: Sample positively-rated explanations.

Faithfulness Evaluation

In this section, we try to quantitatively measure the faithfulness of our explanations, i.e. how well they actually reflect the underlying VQA system’s reasoning. First, we measured how many words in a generated explanation are actually linked to a visual segmentation in the image, and next we measured the fraction of the common nouns, whose cosine similarities with at least one of the 3,000 segmentation categories’ GLoVe representation is over than 0.6, that are linked to a segmentation.

We analyzed the explanations from 1,000 correctly answered questions from the VQA-X test data. On average, our model is able to link 1.6 words in an explanation to an image segment, and 81.7% of the common nouns are linked to visual segments. This indicates that, at least with respect to nouns that might conceivably reference observable objects, the explanations are faithfully making reference to visual detections that are actually utilized by the underlying VQA system.

Conclusion and Future Work

This paper has presented a new approach to generating multimodal explanations for visual question answering systems that aims to more faithfully represent the reasoning of the underlying VQA system while maintaining the style of human explanations. The approach generates textual explanations with words linked to relevant image regions actually attended to by the underlying VQA system. Experimental evaluations of both the textual and visual aspects of the explanations using both automated metrics and crowdsourced human judgments were presented that demonstrate the advantages of this approach compared to a previously-published competing method. In the future, we would like to incorporate more information from the VQA network into the explanations. In particular, we would like to integrate the output of network dissection [Bau et al.2017] to allow recognizable concepts in the learned hidden-layer representations to be included in explanations. We would also like to move beyond objects and include attributes, relations, and activities actually detected and used by a VQA system in textual explanations.


This research was supported by the DARPA XAI program under a grant from AFRL. We would also like to thank Dong Huk Park and Lisa Hendricks for the VQA-X dataset.


  • [Anderson et al.2016] Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. Spice: Semantic Propositional image caption evaluation. In European Conference on Computer Vision, 382–398.
  • [Anderson et al.2018] Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , volume 3,  6.
  • [Antol et al.2015] Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Lawrence Zitnick, C.; and Parikh, D. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, 2425–2433.
  • [Banerjee and Lavie2005] Banerjee, S., and Lavie, A. 2005. Meteor: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72.
  • [Bau et al.2017] Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Network Dissection: Quantifying Interpretability of Deep Visual Representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • [Bilgic and Mooney2005] Bilgic, M., and Mooney, R. 2005. Explaining Recommendations: Satisfaction vs. Promotion. In Proceedings of Beyond Personalization 2005: A Workshop on the Next Stage of Recommender Systems Research at the 2005 International Conference on Intelligent User Interfaces.
  • [Cho et al.2014] Cho, K.; van Merriënboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

    , 1724–1734.
    Doha, Qatar: Association for Computational Linguistics.
  • [Fukui et al.2016] Fukui, A.; Park, D. H.; Yang, D.; Rohrbach, A.; Darrell, T.; and Rohrbach, M. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Empirical Methods on Natural Language Processing.
  • [Gan et al.2017] Gan, C.; Li, Y.; Li, H.; Sun, C.; and Gong, B. 2017. VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 3.
  • [Hendricks et al.2016] Hendricks, L. A.; Akata, Z.; Rohrbach, M.; Donahue, J.; Schiele, B.; and Darrell, T. 2016. Generating Visual Explanations. In European Conference on Computer Vision, 3–19. Springer.
  • [Hendricks et al.2018] Hendricks, L. A.; Hu, R.; Darrell, T.; and Akata, Z. 2018. Grounding Visual Explanations. arXiv preprint arXiv:1807.09685.
  • [Hu et al.2018] Hu, R.; Dollár, P.; He, K.; Darrell, T.; and Girshick, R. 2018. Learning to Segment Every Thing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations.
  • [Kitaev and Klein2018] Kitaev, N., and Klein, D. 2018. Constituency Parsing with a Self-Attentive Encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics.
  • [Lin et al.2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common Objects in COntext. In European conference on computer vision, 740–755. Springer.
  • [Lin2004] Lin, C.-Y. 2004. Rouge: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out.
  • [Lu et al.2016] Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2016. Hierarchical Question-Image Co-attention for Visual Question Answering. In Advances In Neural Information Processing Systems, 289–297.
  • [Maas, Hannun, and Ng2013] Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In

    International Conference on Machine Learning

    , volume 30,  3.
  • [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, 311–318. Stroudsburg, PA, USA: Association for Computational Linguistics.
  • [Park et al.2018] Park, D. H.; Hendricks, L. A.; Akata, Z.; Rohrbach, A.; Schiele, B.; Darrell, T.; and Rohrbach, M. 2018. Multimodal Explanations: Justifying Decisions and Pointing to the Evidence. In Proceedings of the IEEE conference on computer vision and pattern recognition.
  • [Pele and Werman2008] Pele, O., and Werman, M. 2008. A Linear Time Histogram Metric for Improved Sift Matching. In Computer Vision–ECCV 2008, 495–508. Springer.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing, 1532–1543.
  • [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. 91–99.
  • [Selvaraju et al.2017] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision.
  • [Vedantam, Lawrence Zitnick, and Parikh2015] Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-Based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4566–4575.
  • [Wu, Hu, and Mooney2018] Wu, J.; Hu, Z.; and Mooney, R. J. 2018. Joint image captioning and question answering. In VQA Challenge and Visual Dialog Workshop at the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-18).