Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

02/15/2018 ∙ by Dong Huk Park, et al. ∙ 1

Deep models that are both effective and explainable are desirable in many settings; prior explainable models have been unimodal, offering either image-based visualization of attention weights or text-based generation of post-hoc justifications. We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths. We collect two new datasets to define and evaluate this task, and propose a novel model which can provide joint textual rationale generation and attention visualization. Our datasets define visual and textual justifications of a classification decision for activity recognition tasks (ACT-X) and for visual question answering tasks (VQA-X). We quantitatively show that training with the textual explanations not only yields better textual justification models, but also better localizes the evidence that supports the decision. We also qualitatively show cases where visual explanation is more insightful than textual explanation, and vice versa, supporting our thesis that multimodal explanation models offer significant benefits over unimodal approaches.



There are no comments yet.


page 1

page 4

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Explaining decisions is an integral part of human communication, understanding, and learning, and humans naturally provide both deictic (pointing) and textual modalities in a typical explanation. We aim to build deep learning models that also are able to explain their decisions with similar fluency in both visual and textual modalities. Previous machine learning methods for explanation were able to provide a text-only explanation conditioned on an image in context of a task, or were able to visualize active intermediate units in a deep network performing a task, but were unable to provide explanatory text grounded in an image.

We propose a new model which can jointly generate visual and textual explanations, using an attention mask to localize salient regions when generating textual rationales. We argue that to train effective models, measure the quality of the generated explanations, compare with other methods, and understand when methods will generalize, it is important to have access to ground truth human explanations. Unfortunately, there is a dearth of datasets which include examples of how humans justify specific decisions. Thus, we collect two new datasets, ACT-X and VQA-X, which allow us to train and evaluate our novel model, which we call the Pointing and Justification Explanation (PJ-X) model. PJ-X is explicitly multimodal: it incorporates an explanatory attention step, which allows our model to both visually point to the evidence and justify a model decision with text.

Figure 1: For a given question and an image, our Pointing and Justification Explanation (PJ-X) model predicts the answer and multimodal explanations which both point to the visual evidence for a decision and provide textual justifications. We show that considering multimodal explanations results in better visual and textual explanations.

To illustrate the utility of multimodal explanations, consider Figure 1. In both examples, the question “Is this a healthy meal?” is asked, and the PJ-X model correctly answers either “no” or “yes” depending on the visual input. To justify why the image is not healthy, the generated textual justification mentions the kinds of unhealthy food in the image (“hot dog” and “toppings”). In addition to mentioning the unhealthy food, our model is able to point to the hot dog in the image. Likewise, to justify why the image on the right is healthy, the textual explanation model mentions “vegetables”. Note that PJ-X model then points to the vegetables, which are mentioned in the textual explanation, but not other items in the image, such as the bread.

We propose VQA and activity recognition as testbeds for studying explanations because they are challenging and important visual tasks which have interesting properties for explanation. VQA is a widely studied multimodal task that requires visual and textual understanding as well as commonsense knowledge. The newly collected VQA v2 dataset [16] includes complementary pairs of questions and answers. Complementary VQA pairs ask the same question of two semantically similar images which have different answers. As the two images are semantically similar, VQA models must employ finegrained reasoning to answer the question correctly. Not only is this an interesting and useful setting for measuring overall VQA performance, but it is also interesting when studying explanations. By comparing explanations from complementary pairs, we can more easily determine whether our explanations focus on the important factors for making a decision.

Additionally, we collect annotations for activity recognition using the MPII Human Pose (MHP) dataset [2]

. Activity recognition in still images relies on a variety of cues, such as pose, global context, and the interaction between humans and objects. Though a recognition model can potentially classify an activity correctly, it is not capable of indicating which factors influence the decision process. Furthermore, classifying specific activities requires understanding finegrained differences (e.g., “road biking” and “mountain biking” include similar objects like “bike” and “helmet,” but road biking occurs on a road whereas mountain biking occurs on a mountain path). Such finegrained differences are interesting yet difficult to capture when explaining neural network decisions.

In sum, we present ACT-X and VQA-X, two novel datasets of human annotated multimodal explanations (visual and textual) for activity recognition and visual question answering. These datasets allow us to train the Pointing and Justification (PJ-X) model which goes beyond current visual explanation systems by producing multimodal explanations, justifying the predicted answer post-hoc by visual pointing and textual justification. Our datasets also allow to effectively evaluate explanation models, and we show that the PJ-X model outperforms strong baselines, and, importantly, that by generating multimodal explanations, we outperform models which only produce visual or textual explanations. We will release our model architecture, learned weights, and datasets upon acceptance of this paper.

2 Related Work

Explanations. Early textual explanation models span a variety of applications (e.g., medical [30] and feedback for teaching programs [19, 31, 10]) and are generally template based. More recently, [17] developed a deep network to generate natural language justifications of a fine-grained object classifier. However, unlike our model, it does not provide multimodal explanations. Furthermore, [17] could not train on reference human explanations as no such datasets existed. We provide two datasets with reference textual explanations to enable more research in the direction of textual explanation generation.

A variety of work has proposed methods to visually explain decisions. Some methods find discriminative visual patches [12, 7] whereas others aim to understand intermediate features which are important for end decisions [37, 13, 38]

e.g. what does a certain neuron represent. Our model PJ-X points to visual evidence via an attention mechanism which is an intuitive way to convey knowledge about what is important to the network without requiring domain knowledge. Unlike prior work, PJ-X generates

multimodal explanations in the form of explanatory sentences and attention maps pointing to the visual evidence.

Prior work has investigated how well generated visual explanations align with human gaze [11]. However, when answering a question, humans do not always look at image regions which are necessary to explain a decision. For example, given the question “What is the name of the restaurant?”, human gaze might capture other buildings before settling on the restaurant. In contrast, when we collect our annotations, we allow annotators to view the entire image and ask them to point to the most relevant visual evidence for making a decision. Furthermore, our visual explanations are collected in conjunction with textual explanations to build and evaluate multimodal explanation models.

Figure 2: In comparison to descriptions, our VQA-X explanations focus on the evidence that pertains to the question and answer instead of generally describing the scene. For ACT-X, our explanations are task specific whereas descriptions are more generic.
Dataset Split #Imgs #Q/A Pairs #Unique Q. #Unique A. #Expl. (Avg. #w) Expl.Vocab Size #Comple. Pairs #Visual Ann.
VQA-X Train ()
Val ()
Test ()
Total ()
ACT-X Train ()
Val ()
Test ()
Total ()
Table 1: Dataset statistics for VQA-X (top) and ACT-X (bottom). Unique Q. = Unique questions, Unique A. = Unique answers, Expl. = Explanations, Avg. #w = Average number of words, Comple. Pairs = Complementary pairs, Visual Ann. = Visual annotations.

Visual Question Answering and Attention. Initial approaches to VQA used full-frame representations [22], but most recent approaches use some form of spatial attention [36, 35, 39, 9, 34, 29, 14, 18]. We base our method on [14], the winner of VQA 2016 challenge, however we use an element-wise product as opposed to compact bilinear pooling. [18] has explored the element-wise product for VQA just as we do in our method, however [18] improves performance by applying hyperbolic tangent (TanH) after the multimodal pooling whereas we improve by applying signed square-root and L2 normalization.

Activity Recognition. Recent work on activity recognition in still images relies on a variety of cues, such as pose and global context [15, 23, 26]. Specifically, [15] considers additional image regions and [23] considers a global image feature in addition to the region where an activity occurs. Generally, works on the MPII Human Activities dataset provide the ground truth location of a human at test time [15]. In contrast, we consider a more realistic scenario and do not make any assumptions about where the activity occurs at test time. Our model relies on attention to focus on important parts of an image for classification and explanation.

3 Multimodal Explanations

We propose multimodal explanation tasks with visual and textual components, defined on both visual question answering and activity recognition testbeds. To train and evaluate models for this task we collect two multimodal explanation datasets: Visual Question Answering Explanation (VQA-X) and Activity Explanation (ACT-X) (see  Table 1 for a summary). For each dataset we collect textual and visual explanations from human annotators.

VQA Explanation Dataset (VQA-X). The Visual Question Answering (VQA) dataset [3] contains open-ended questions about images which require understanding vision, language, and commonsense knowledge to answer. VQA consists of approximately K MSCOCO images [21], with questions per image and answers per question.

(a) VQA-X
(b) ACT-X
Figure 3: Human annotated visual explanations. On the left: example annotations collected on VQA-X dataset. On the right: Example annotations collected on ACT-X dataset. The visual evidence that justifies the answer is segmented in yellow.
Figure 4:

Human visual annotations from VQA-HAT and VQA-X. We aggregate all the annotations in each image and normalize them to create a probability distribution. The distribution is then visualized over the image as a heatmap.

Many questions in VQA are of the sort: “What is the color of the banana?”. It is difficult for humans to explain answers to such questions because it requires explaining a fundamental visual property: color. Thus, we aim to provide textual explanations for questions that go beyond such trivial cases. To do this, we consider the annotations collected in [40] which say how old a human must be to answer a question. We find that questions which require humans to be of age 9 or higher are generally interesting to explain.

Additionally, we consider complementary pairs from the VQA v2 dataset [16]. Complementary pairs consist of a question and two similar images which give two different answers. Complementary pairs are particularly interesting for the explanation task because they allow us to understand whether explanations name the correct evidence based on image content, or whether they just memorize which content to consider based off specific question types.

We collect a single textual explanation for QA pairs in the training set and three textual explanations for test/val QA pairs. Some examples can be seen Figure 2.

Action Explanation Dataset (ACT-X). The MPII Human Pose (MHP) dataset [2] contains K images extracted from Youtube videos. From the MHP dataset, we select all images that pertain to activities, resulting in images total. For each image we collect three explanations. During data annotation, we ask the annotators to complete the sentence “I can tell the person is doing (action) because..” where the action is the ground truth activity label. We also ask them to use at least 10 words and avoid mentioning the activity class in the sentence. MHP dataset also comes with sentence descriptions provided by [27]. See Figure 2 for examples of descriptions and explanations.

Ground truth for pointing. In addition to textual justifications, we collect visual explanations from humans for both VQA-X and ACT-X datasets in order to evaluate how well the attention of our model corresponds to where humans think the evidence for the answer is. Human-annotated visual explanations are collected via Amazon Mechanical Turk where we use the segmentation UI interface from the OpenSurfaces Project [6]. Annotators are provided with an image and an answer (question and answer pair for VQA-X, class label for ACT-X). They are asked to segment objects and/or regions that most prominently justify the answer. For each dataset we randomly sample images from the test split, and for each image we collect 3 annotations. Some examples can be seen in Figure 3.

Figure 5: Our Pointing and Justification (PJ-X) architecture generates a multimodal explanation which includes textual justification (“it contains a variety of vegetables on the table”) and pointing to the visual evidence.

Comparing with VQA-HAT. A thorough comparison between our dataset and VQA-HAT dataset from [11] is currently not viable because the two datasets have different splits and the overlap is small. However, we present qualitative comparison in Figure 4. In the first row, our VQA-X annotation has a finer granularity since it segments out the objects in interest more accurately than the VQA-HAT annotation. In the second row, our annotation contains less extraneous information than the VQA-HAT annotation. Since the VQA-HAT annotations are collected by having humans “unblur” the images, they are more likely to introduce noise when irrelevant regions are uncovered.

4 Pointing and Justification Model (PJ-X)

The goal of our work is to implement a multimodal explanation system that justifies a decision with natural language and points to the evidence. We deliberately design our Pointing and Justification Model (PJ-X) to allow training these two tasks. Specifically we want to rely on natural language justifications and the classification labels as the only supervision. We design the model to learn how to point in a latent way. For the pointing we rely on an attention mechanism [4] which allows the model to focus on a spatial subset of the visual representation.

We first predict the answer given an image and question using the answering model. Then given the answer, question, and image, we generate visual and textual explanations with the multimodal explanation model. An overview of our model is presented in Figure 5.

Answering model. In visual question answering the goal is to predict an answer given a question and an image. For activity recognition we do not have an explicit question. Thus, we ignore the question which is equivalent to setting the question representation to

, a vector of ones.

We base our answering model on the overall architecture from the MCB model [14], but replace the MCB unit with a simpler element-wise multiplication to pool multimodal features. We found that this leads to similar performance, but much faster training (see supplemental material).

In detail, we extract spatial image features from the last convolutional layer of ResNet-152 followed by convolutions () giving a spatial image feature. We encode the question with a 2-layer , which we refer to as . We combine this and the spatial image feature using element-wise multiplication followed by signed square-root, L2 normalization, and Dropout, and two more layers of

convolutions with ReLU in between. This process gives us a

attention map . We apply softmax to produce a normalized soft attention map.

The attention map is then used to take the weighted sum over the image features and this representation is once again combined with the LSTM feature to predict the answer as a classification problem over all answers . We provide an extended formalized version in the supplemental.

Multimodal explanation model.

We argue that to generate multimodal explanation, we should condition it on question, answer, and image. For instance, to be able to explain “Because they are Vancouver police” in Figure 2, the model needs to see the question, i.e. “Can these people arrest someone?”, the answer, i.e. “Yes” and the image, i.e. the “Vancouver police” banner on the motorcycles.

We model this by pooling image, question, and answer representations to generate attention map, our Visual Pointing. The Visual Pointing is further used to create attention features that guide the generation of our Textual Justification.

More specifically, the answer predictions are embedded in a -dimensional space followed by non-linearity and a fully connected layer: . To allow the model to learn how to attend to relevant spatial location based on the answer, image, and question, we combine this answer feature with Question-Image embedding from the answering model. Applying convolutions, element-wise multiplication followed by signed square-root, L2 normalization, and Dropout, results in a multimodal feature.


with Relu . Next we predict a attention map and apply softmax to produce a normalized soft attention map, our Visual Pointing , which aims to point at the evidence of the generated explanation:


Using , we compute the attended visual representation, and merge it with the LSTM feature that encodes the question and the embedding feature that encodes the answer:


This combined feature is then fed into an LSTM decoder to generate our Textual Justifications that are conditioned on image, question, and answer.

Textual Justifications are a sequence of words and our model predicts one word at each time step conditioned on the previous word and the hidden state of the LSTM:


5 Experiments

In this section, after detailing the experimental setup, we present quantitative results on ablations done for textual justification and visual pointing tasks, and discuss their implications. Additionally, we provide and analyze qualitative results for both tasks.

5.1 Experimental Setup

Here, we detail our experimental setup in terms of model training, hyperparameter settings, and evaluation metrics.

Model training and hyperparameters. For VQA, the answering model of PJ-X is pre-trained on the VQA v2 training set [16]. We then freeze or finetune the weights of the answering model when training the multimodal explanation model on textual annotations as the VQA-X dataset is significantly smaller than the original VQA training set. For activity recognition, answering and explanation components of PJ-X are trained jointly. The spatial feature size of PJ-X is . For VQA, we limit the answer space to the most frequently occurring answers on the training set (i.e. ) whereas for activity recognition, . We set the answer embedding size as for both tasks.

Evaluation metrics. We evaluate our textual justifications w.r.t BLEU-4 [24], METEOR [5], ROUGE [20], CIDEr [32] and SPICE [1] metrics, which measure the degree of similarity between generated and ground truth sentences. We also include human evaluation since automatic metrics do not always reflect human preference. We randomly choose 1000 data points each from the test splits of VQA-X and ACT-X datasets, where the model predicts the correct answer, and then for each data point ask 3 human subjects to judge whether a generated explanation is better than, worse than, or equivalent to the ground truth explanation (we note that human judges do not know what explanation is ground truth and the order of sentences is randomized). We report the percentage of generated explanations which are equivalent to or better than ground truth human explanations, when at least 2 out of 3 human judges agree.

For visual pointing task, we use Earth Mover’s Distance (EMD) [28] which measures the distance between two probability distributions over a region. We use the code from [25] to compute EMD. We also report on Rank Correlation which was used in [11]. For computing Rank Correlation, we follow [11] where we scale the generated attention map and the human ground-truth annotations from the VQA-X/ACT-X/VQA-HAT datasets to , rank the pixel values, and then compute correlation between these two ranked lists.

GT-ans Train- Att. VQA-X ACT-X
Condi- ing for Automatic evaluation Human Automatic evaluation Human
Approach tioning Data Expl. B M R C S eval B M R C S eval
[17] Yes Desc. No 12.9 15.9 39.0 12.4 12.0 17.4
Ours on Descriptions Yes Desc. Yes 6.1 12.8 26.4 36.2 12.1 34.5 6.9 12.9 28.3 20.3 7.3 22.9
Ours w/o Attention Yes Expl. No 18.0 17.6 42.4 66.3 14.3 40.1 16.9 17.0 42.0 33.3 10.6 21.4
Ours Yes Expl. Yes 19.8 18.6 44.0 73.4 15.4 45.1 24.5 21.5 46.9 58.7 16.0 38.2
Ours on Descriptions No Desc. Yes 5.9 12.6 26.3 35.2 11.9 5.2 11.0 26.5 10.4 4.6
Ours w/o Attention No Expl. No 18.0 17.3 42.1 63.6 13.8 11.9 13.6 37.9 16.9 5.7
Ours No Expl. Yes 19.5 18.2 43.4 71.3 15.1 15.3 15.6 40.0 22.0 7.2
Table 2: Evaluation of Textual Justifications. Evaluated automatic metrics: BLEU-4 (B), METEOR (M), ROUGE (R), CIDEr (C) and SPICE (S). Reference sentence for human and automatic evaluation is always an explanation. All in %. Our proposed model compares favorably to baselines.

5.2 Textual Justification

We ablate PJ-X and compare with related approaches on our VQA-X and ACT-X datasets through automatic and human evaluations for the generated explanations.

Details on compared models. We compare with the state-of-the-art [17]

using publicly available code. For fair comparison, we use ResNet features extracted from the entire image when training

[17]. The generated sentences from [17] are conditioned on both the image and the class label. [17] uses discriminative loss, which enforces the generated sentence to contain class-specific information, to back-propagate policy gradients when training the language generator, and thus involves training a separate sentence classifier to generate rewards. Our model does not use discriminative loss/policy gradients and does not require defining a reward. Note that [17] is trained with descriptions. Similarly, ”Ours on Descriptions” is an ablation in which we train PJ-X on descriptions instead of explanations. ”Ours w/o Attention” is similar to [17] in the sense that there is no attention mechanism involved when generating explanations, however, it does not use the discriminative loss and is trained on explanations instead of descriptions.

Descriptions vs. Explanations. “Ours” significantly outperforms “Ours with Descriptions” by a large margin on both datasets which is expected as descriptions are insufficient for the task of generating explanations. Additionally, “Ours” compares favorably to [17] even in the case when “Ours” generates textual justifications conditioned on the prediction, not the ground-truth answer. These results demonstrate the limitation of training explanation systems with descriptions, and thus support the necessity of having datasets specifically curated for explanations. “Ours on Descriptions” performs worse on certain metrics compared to [17] which may be attributed to additional training signals generated from discriminative loss and policy gradients, but further investigation is left for future work.

Unimodal explanations vs. Multimodal explanations. Including attention when generating textual justifications allows us to build a multimodal explanation model. Aside from the immediate benefit of providing visual rationale about a model’s decision, learning to point at visual evidence helps generating better textual justifications. As can be seen from Table 2, “Ours” greatly improves textual justifications compared to “Ours w/o Attention” on both datasets, demonstrating the value of designing multimodal explanation systems.

Earth Mover’s Rank Correlation
(lower is better) (higher is better)
Random Point 6.71 6.59 +0.0017 +0.0003 -0.0001
Uniform 3.60 3.25 +0.0003 -0.0001 -0.0007
HieCoAtt-Q [11] +0.2640
Answering Model 2.77 4.78 +0.2211 +0.0104 +0.2234
Ours 2.64 2.54 +0.3423 +0.3933 +0.3964
Table 3:

Evaluation of Visual Pointing Justifications. For rank correlation, all results have standard error


5.3 Visual Pointing

We compare the visual pointing performance of PJ-X to several baselines and report quantitative results with corresponding analysis.

Details on compared models. We compare our model against the following baselines. Random Point randomly attends to a single point in a grid. Uniform Map

generates attention map that is uniformly distributed over the

grid. In addition to these baselines, we also compare PJ-X attention maps with those generated from state-of-the-art VQA systems such as [11].

Improved localization with textual explanations. We evaluate attention maps using the Earth Mover’s Distance (lower is better) and Rank Correlation (higher is better) on VQA-X and ACT-X datasets in Table 3. From  Table 3, we observe that “Ours” not only outperforms baselines Random Point and Uniform Map, but also our answering model and [11] on both datasets and on both metrics. The attention maps generated from our answering model and [11] do not receive training signals from the textual annotations as they are only trained to predict the correct answer, whereas the attention maps generated from PJ-X multimodal explanation model are latently learned through supervision of textual annotations. The experiment results imply that learning to generate textual explanations helps improve visual pointing task, and further confirm the advantage of having a multimodal explanation system.

Figure 6: VQA-X qualitative results: For each image the PJ-X model provides an answer and a justification, and points to the evidence for that justification. We show pairs of images from VQA v2 complementary pairs.
Figure 7: ACT-X qualitative results: For each image the PJ-X model provides an answer and a justification, and points to the evidence for that justification.

5.4 Qualitative Results

In this section we present our qualitative results on VQA-X and ACT-X datasets demonstrating that our model generates high quality sentences and the attention maps point to relevant locations in the image.

VQA-X. Figure 6 shows qualitative results on our VQA-X dataset. We show pairs of images that form complementary pairs in VQA v2. Our textual justifications are able to both capture common sense and discuss specific image parts important for answering a question. For example, when asked “Is this a zoo?”, the explanation model is able to discuss what the concept of “zoo” represents, i.e. “animals in an enclosure”. When determining whether the water is calm, which requires discussing specific image regions, the textual justification discusses foam on the waves.

Visually, we notice that our attention model is able to point to important visual evidence. For example in the top row of 

Figure 6, for the question “Is this a zoo?” the visual explanation focuses on the field in one case, and the fence in another.

ACT-X. Figure 7 shows results on our ACT-X dataset. Textual explanations discuss a variety of visual cues important for correctly classifying activities such as global context, e.g. “a grassy lawn / a mountainous area”, and person-object interaction, e.g. “pushing a lawn mower / riding a bicycle” for mowing lawn and mountain biking, respectively. These explanations require determining which of many multiple cues are appropriate to justify a particular action.

Our model points to visual evidence important for understanding each human activity. For example to classify “mowing lawn” in the top row of  Figure 7 the model focuses both on the person, who is on the grass, as well as the lawn mower. Our model can also differentiate between similar activities based on the context, e.g.”mountain biking” or ”road biking”.

Figure 8: Qualitative results comparing the insightfulness of visual pointing and textual justification. The left example demonstrates how visual pointing is more informative than textual justification whereas the right example shows the opposite.

Explanation Consistent with Incorrect Prediction. Generating reasonable explnations for correct answers is important, but it is also crucial to see how a system behaves in the face of incorrect predictions. Such analysis would provide insights into whether the explanation generation component of the model is consistent with the answer prediction component or not. In Figure 9, we can see that the explanations are consistent with the incorrectly predicted answer for both VQA-X and ACT-X. For instance in the bottom-right example, we see that the model attends to a vacuum-like object and textually justifies the prediction ”vacuuming”. Such consistency between the answering model and the explanation model is also shown in Table 2 where we see a drop in performance when explanations are conditioned on predictions (bottom rows) instead of the ground-truth answers (top rows).

Figure 9: Visual and textual explanations generated by our model conditioned on incorrect predictions.

5.5 Usefullness of Multimodal Explanations

In this section, we address some of the advantages of generating multimodal explanations. In particular, we look at cases where visual explanations are more informative than the textual explanations, and vice versa. We also investigate how multimodal explanations can help humans diagnose the performance of an AI system.

Complementary Explanations. Multimodal explanations can support different tasks or support each other. Interestingly, in Figure 8, we present some examples where visual pointing is more insightful than textual justification, and vice versa. Looking at the left example in Figure 8, it is rather difficult to explain “leaning” with language and the model resorts to generating a correct, yet uninsightful sentence. However, the concept is easily conveyed when looking at the visual pointing result. In contrast, the right example shows the opposite. Looking at only some patches of the sky presented by the visual pointing result does not necessarily confirm if the scene is cloudy or not, while it is also unclear if attending to the entire region of the sky is a desired behavior. Yet, the textual justification succinctly captures the rationale. These examples clearly demonstrate the value of generating multimodal explanations.

Diagnostic Explanations. We evaluate an auxiliary task where humans have to guess whether the system correctly or incorrectly answered the question. The predicted answer is not shown; only image, question, correct answer, and textual/visual explanations. The set contains 50% correctly answered questions. We compare our model against the models used for ablations in Table 2. Table 4 indicates that explanations are better than no explanations and our model is more helpful than models trained on descriptions and also models trained to generate textual explanations only.

vqa-x act-x
Without explanation 57.5% 51.5%
Ours on Descriptions 66.5% 72.5%
Ours w/o Attention 61.5% 76.5%
Ours 70.0% 80.5%
Table 4: Accuracy of humans judging if the model predicted correctly.

6 Conclusion

As a step towards explainable AI models, we proposed multimodal explanations for real-world tasks. Our model is the first to be capable of providing natural language justifications of decisions as well as pointing to the evidence in an image. We have collected two novel explanation datasets through crowd sourcing for visual question answering and activity recognition, i.e. VQA-X and ACT-X. We quantitatively demonstrated that learning to point helps achieve high quality textual explanations. We also quantitatively show that using reference textual explanations to train our model helps achieve better visual pointing. Furthermore, we qualitatively demonstrated that our model is able to point to the evidence as well as to give natural sentence justifications, similar to ones humans give.


  • [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Semantic propositional image caption evaluation. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    , 2016.
  • [2] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele.

    2d human pose estimation: New benchmark and state of the art analysis.


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2014.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [5] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, volume 29, pages 65–72, 2005.
  • [6] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Opensurfaces: A richly annotated catalog of surface appearance. In SIGGRAPH, 2013.
  • [7] T. Berg and P. N. Belhumeur. How do you tell a blackbird from a crow? In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013.
  • [8] O. Biran and K. McKeown. Justification narratives for individual classifications. In Proceedings of the AutoML workshop at ICML, volume 2014, 2014.
  • [9] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv:1511.05960, 2015.
  • [10] M. G. Core, H. C. Lane, M. Van Lent, D. Gomboc, S. Solomon, and M. Rosenberg.

    Building explainable artificial intelligence systems.

    In Proceedings of the national conference on artificial intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.
  • [11] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? CoRR, abs/1606.03556, 2016.
  • [12] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes paris look like paris? ACM Transactions on Graphics, 31(4), 2012.
  • [13] V. Escorcia, J. C. Niebles, and B. Ghanem. On the relationship between visual attributes and convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [14] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , 2016.
  • [15] G. Gkioxari, R. Girshick, and J. Malik. Contextual action recognition with r* cnn. In Proceedings of the IEEE international conference on computer vision, pages 1080–1088, 2015.
  • [16] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [17] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. Generating visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [18] J. Kim, K. W. On, J. Kim, J. Ha, and B. Zhang. Hadamard product for low-rank bilinear pooling. CoRR, abs/1610.04325, 2016.
  • [19] H. C. Lane, M. G. Core, M. Van Lent, S. Solomon, and D. Gomboc. Explainable artificial intelligence for training and tutoring. Technical report, DTIC Document, 2005.
  • [20] C.-Y. Lin. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, 2004.
  • [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), 2014.
  • [22] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  • [23] A. Mallya and S. Lazebnik. Learning models for actions and person-object interactions with transfer to question answering. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [24] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002.
  • [25] O. Pele and M. Werman. Fast and robust earth mover’s distances. In 2009 IEEE 12th International Conference on Computer Vision, pages 460–467. IEEE, September 2009.
  • [26] L. Pishchulin, M. Andriluka, and B. Schiele. Fine-grained activity recognition with holistic and pose based features. In Proceedings of the German Confeence on Pattern Recognition (GCPR), pages 678–689. Springer, 2014.
  • [27] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • [28] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1998.
  • [29] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [30] E. H. Shortliffe and B. G. Buchanan. A model of inexact reasoning in medicine. Mathematical biosciences, 23(3):351–379, 1975.
  • [31] M. Van Lent, W. Fisher, and M. Mancuso. An explainable artificial intelligence system for small-unit tactical behavior. In NCAI, 2004.
  • [32] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2015.
  • [33] M. R. Wick and W. B. Thompson. Reconstructive expert system explanation. Artificial Intelligence, 54(1-2):33–70, 1992.
  • [34] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning (ICML), 2016.
  • [35] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [36] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [37] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2014.
  • [38] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [39] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded Question Answering in Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [40] C. L. Zitnick, A. Agrawal, S. Antol, M. Mitchell, D. Batra, and D. Parikh. Measuring machine intelligence through visual question answering. CoRR, abs/1608.08716, 2016.