Modern neural networks are good at localizing objects, predicting object categories and describing scenes with natural language. However, the decision processes of neural networks are often opaque. Therefore, in order to interpret and monitor neural networks, providing explanations of network decisions has gained interest. Here, we focus on providing such explanations via natural language, i.e. textual explanations. A textual explanation system for a classification network ideally both discusses discriminative features of its predicted classes and names image-relevant attributes. However, sometimes these two goals are in opposition. For example, if one discriminative attribute frequently occurs within a class, a model may learn to justify its prediction by mentioning this attribute as a discriminative feature even if the input does not contain the attribute. In this work, we aim at resolving such conflicts and design a framework that automatically generates textual explanations that justify a classification decision and simultaneously ground discriminative object properties both in the explanation and in the image via a novel a phrase-critic model. Our model significantly improves the image-relevance of explanations in comparison to prior works.
In our framework, a phrase-critic model is first trained specifically to ground phrases, irrespective of linguistic fluency. Fluency is ensured by training an LSTM-based explanation model to generate candidate sentences which discuss class discriminative features. In other words, since our phrase-critic model does not focus on fluency, it should be more reliable when understanding sentence correctness; meanwhile, our explanation generation mechanism ensures the appearance of class discriminative information as well as fluency. As a result, we obtain more accurate and linguistically satisfying explanations than those generated by only enforcing a discriminative training term as done in prior work hendricks16eccv . An important side effect of our method is visual explanation–a visualization of the grounding of discriminative object parts in the image that are mentioned in the explanation.
2 Related Work
In teach1981analysis , trust is regarded as a primary reason to explore explainable intelligent systems. We argue a system which outputs discriminative features of an object class without being image relevant is likely to lose the trust of users. Consequently, we seek to explicitly enforce image relevancy with our model.
Similar to biran2014justification , we aim at providing justifications to explain which evidence is important for a decision as opposed to introspective explanations that explains the intermediate activations of neural networks. Recently, hendricks16eccv
proposed to generate natural language justifications of a fine-grained object classifier. However, it does not ground the relevant object parts in the sentence or the image. Inpark2016attentive , similar explanations are generated for activities and VQA pairs. Although an attention based explanation system is proposed, there are no constraints to ensure the actual presence of the mentioned attributes or entities in the image. Consequentially, these related works hendricks16eccv ; park2016attentive , albeit generating convincing textual explanations, do not include a process for networks to correct themselves if their textual explanation is not well-grounded visually. In contrast, we propose a general process to first check whether explanations are accurately aligned with image input and then improve textually explanations by selecting a better-aligned candidate.
3 Grounded Visual Explanations
Our model consists of three main components. First, a text generation system (based onhendricks16eccv ) samples many (100 in our experiments) textual explanations. hendricks16eccv is trained with a discriminative loss to encourage sentences to mention class specific attributes. Next, a phrase grounding model (based on hu2017modeling ) grounds phrases in the generated textual explanations. Finally, our proposed phrase-critic model determines which textual explanations are preferred based on how well textual explanations are grounded in the image. As shown in Figure 1, our system first generates possible explanations (e.g., “The red bird has a red beak and a black face”), determines whether constituent phrases (e.g., “red bird”, “red beak”,“black face”) are present in the image, and then assigns a score to the explanation.
Phrase-Critic. Our phrase-critic model is the core to our framework. It takes a list of , where is the attribute phrase,
is the corresponding region (more precisely, visual features extracted from the region), andthe region score, and maps them into a single image relevance score . For a given attribute phrase such as “black beak”, we ground (localize) it into a corresponding image region and obtain its localization score , using an out-of-shelf localization model from hu2017modeling pretrained on VisualGenome. It is worth noting that the scores directly produced by the grounding model are not comparable across images and difficult to be directly combined with other metrics, such as sentence fluency, because these scores are difficult to normalize across different images and different visual parts. For example, a correctly grounded phrase “yellow belly” may have a much smaller score than the correctly grounded phrase “yellow eye” because a bird belly is less well defined than a bird eye; for another example, an occluded image tend to score lower with all explanations Henceforth, our phrase-critic model plays an essential role in producing normalized, utilizable and comparable scores. More specifically, given an image , the phrase-critic model processes the list of by first encoding each
into a fixed-dimensional vectorwith an LSTM and then applying a two-layer neural network to regress the final score which reflects the overall image relevancy of an explanation.
where consists of an LSTM encoder and a small two-layer network .
We construct a few explanation pairs for each image. Each explanation pair consists of a positive explanation (image-relevant) and a negative explanation (not image-relevant). We then train our explanation critic using the following margin-based ranking loss on each pair of positive and negative explanations, to encourage the model to give higher scores to positive explanations than negative explanations:
where are matching attribute phrase whereas are mismatching attribute phrases respectively, therefore and are the scores of the positive and the negative explanations. We use in our implementation.
Flipped Attribute Training. The simplest way to sample a negative pair is to consider a mismatching ground truth image and sentence pair. However, due to the fine-grained nature of our dataset, we empirically found that naively sampling out-of-class negative examples can risk the negative examples being visually too different (such as a Cardinal and an American Crow). Inspired by a relative attribute paradigm for recognition and retrieval relativeattributesiccv , we create negative examples by flipping attributes corresponding to color and size in attribute phrases. For example, if a ground truth sentence mentions a “yellow belly” and “red head” we might change the attribute phrase “red head” to “black head”. This means the negative explanation still mentions some attributes present in the image, but is not completely correct.
Ranking Explanations. After generating a set of candidate explanations and extracting an explanation score using our explanation model, we choose the best explanation based on the score for each explanation. In practice, we find it is important to rank sentences based on both the relevance score and a fluency score (defined as the ). Including is important because the explanation scorer will rank “This is a bird with a long neck, long neck, and red beak” high (if a long neck and red beak are present) even though mentioning “long neck” twice is clearly ungrammatical.
Grounding Visual Features. The framework for grounding visual features involves three steps: generating visual explanations, factorizing the sentence into smaller chunks, and localizing each chunk with a grounding model. Visual explanations are produced using the model of hendricks16eccv . For each explanation we extract a list of attribute phrases () using a rule-based attribute phrase chunker.
Once we have extracted attribute phrases , we ground each of them to a visual region in the original image by using the baseline model proposed in hu2017modeling trained on the Visual Genome dataset krishna2016visual . For a given attribute phrase , the grounding model localizes the phrase into an image region, returning a bounding box and a score of how likely the returned bounding box matches the phrase. The attribute phrase, the corresponding region, and the region score form an attribute phrase grounding . This attribute phrase grounding is used as an input to our phrase-critic.
Whereas visual descriptions are encouraged to discuss attributes which are relevant to a specific class, the grounding model is only trained to determine whether a natural language phrase is in an image. Being discriminative rather than generative, the critic model does not have to learn to generate fluent, grammatically correct sentences, and can thus focus on checking whether the mentioned attribute phrases are image-relevant. Consequently, the models are complementary, allowing one model to catch the mistakes of the other.
For our experiments, we use the Caltech UCSD Birds 200-2011 (CUB) dataset welinder10tr and sentences collected by RALS16 . We first compare our proposed model with the baseline visual explanation model of hendricks16eccv . We present results in Figure 2. As a general observation, our critic model (1) grounds attribute phrases both in the image and in the sentence, (2) is in favor of accurate and class-specific attribute phrases and (3) provides the cumulative score of each explanatory sentence. To further emphasize the importance of grounding attribute phrases in the image and in the sentence in evaluating the accuracy of the visual explanation model, let us more closely examine Figure 2 left. We note that the baseline model mentions an “orange beak”. However, the Pigeon Guillemot in the image actually has a black beak, which is properly localized using our proposed method.
Additionally, we also compare our phrase-critic ranking method to a ranking method based solely on sentences fluency (). We sample 100 random images from the test set and find that attributes mentioned by our critic model reflect the image more accurately than the baseline (85% image relevant attributes vs. 79%).
Figure 2 shows sampled sentences and their corresponding scores. We see a precise localization of small regions such as “white eye” for “Brewer Blackbird”. Note that the highest ranked explanations for “Brewer Blackbird” both correctly mention “white eye” which is the strongest distinguishing property of this bird from other black birds. The 3rd sentence gets ranked lower likely due to the explanation “lighter colored eye-coloring” lacking fluency. Additionally, explanations preferred by our phrase-critic model do not blindly mention class-specific attributes that do not appear in the image.
- (1) O. Biran and K. McKeown. Justification narratives for individual classifications. In Proceedings of the AutoML workshop at ICML, volume 2014, 2014.
- (2) L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. Generating visual explanations. In ECCV, 2016.
- (3) R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential expressions with compositional modular networks. In CVPR, 2017.
- (4) R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.
- (5) D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.
- (6) D. H. Park, L. A. Hendricks, Z. Akata, B. Schiele, T. Darrell, and M. Rohrbach. Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:1612.04757, 2016.
- (7) S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In CVPR, 2016.
- (8) R. L. Teach and E. H. Shortliffe. An analysis of physician attitudes regarding computer-based clinical consultation systems. Computers and Biomedical Research, 14(6):542–558, 1981.
- (9) P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-ucsd birds 200. Technical report, California Institute of Technology, 2010.