Neural Network Interpretation via Fine Grained Textual Summarization

05/23/2018 ∙ by Pei Guo, et al. ∙ 2

Current visualization based network interpretation methodssuffer from lacking semantic-level information. In this paper, we introduce the novel task of interpreting classification models using fine grained textual summarization. Along with the label prediction, the network will generate a sentence explaining its decision. Constructing a fully annotated dataset of filter|text pairs is unrealistic because of image to filter response function complexity. We instead propose a weakly-supervised learning algorithm leveraging off-the-shelf image caption annotations. Central to our algorithm is the filter-level attribute probability density function (PDF), learned as a conditional probability through Bayesian inference with the input image and its feature map as latent variables. We show our algorithm faithfully reflects the features learned by the model using rigorous applications like attribute based image retrieval and unsupervised text grounding. We further show that the textual summarization process can help in understanding network failure patterns and can provide clues for further improvements.



There are no comments yet.


page 2

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Given a convolutional network, we’re interested in knowing what features it has learned for making classification decisions. Despite their tremendous success on various computer vision 

[Krizhevsky, Sutskever, and Hinton2012, Simonyan and Zisserman2015, He et al.2016]

tasks, deep neural network models are still commonly viewed as black boxes. The difficulty for neural network understanding mainly lies in the end-to-end learning of the feature extractor sub-network and the classifier sub-network, which often contain millions of parameters. Debugging an over-confident network, which assigns the wrong class label to an image with high probability, can be extremely difficult. This is also true when adversarial noise 

[Goodfellow, Shlens, and Szegedy2014] is added to deliberately guide the network to a wrong conclusion. It is therefore desirable to have some textual output explaining which features were responsible for triggering the error, just like an intelligent compiler does for a grammar bug in code. Network interpretation is also crucial for tasks involving humans, like autonomous driving and medical image analysis. It is therefore important to distill the knowledge learned by deep models and represent it in an easy-to-understand way.

Figure 1: Comparison of visualization-based interpretation [Zhou et al.2016] and interpretation by textual-summarization (the proposed approach). The latter has more semantic details useful for analyzing incorrect predictions.

Fine-grained recognition concerns the problem of discriminating between visually similar sub-categories like different species of gulls or different versions of BMW 3 cars. Humans are usually good at tasks like attribute prediction, keypoint annotation, and image captioning, but we usually find fine grained recognition to be extremely hard without proper training. Network interpretation is therefore useful for fine-grained recognition to find the network’s failure patterns and to educate humans about what the network thinks as informative features. It is worth noting that the proposed algorithm is not constrained to fine-grained recognition. It is equally effective to be applied to general image dataset given accurate image-level attribute annotations.

Current network interpretation are largely visualization-based. Algorithms like CAM [Zhou et al.2016] and Grad-CAM [Selvaraju et al.2017] work by highlighting a region in the image that’s important for decision making. However, we show in Figure 1 that visualization is often inefficient in localizing discriminative parts or providing semantic information for tasks like fine grained recognition. Humans, on the other hand, can justify their conclusions using natural language. For instance, a knowledgeable person looking at a photograph of a bird might say, ”I think this is a Anna’s Hummingbird because it has a straight bill, a rose pink throat and crown. It’s not a Broad-tailed Hummingbird because the later lacks the red crown”. This kind of textual description carries rich semantic information and is easily understandable. Natural language is a logical medium in which to ground the interpretation of deep convolutional models.

In this paper, we propose the novel task of summarizing the decision-making process of deep convolutional models using fine-grained textual descriptions. See Figure 3 for an example. To be specific, we aim to find a list of visual attributes that the network bases its decision on. Our algorithm is dependent only on image-level visual attributes, which can be obtained through ground truth annotation or image caption decomposition. Central to our algorithm is a method to associate the final convolutional layer filter with visual attributes that represents its activating patterns. We discover that in our experiment the model filters don’t necessarily activate on a narrow beam of patterns. We therefore formulate the relationship of the filter  attribute

pair as conditional multinomial probability distribution. An attribute is more likely to represent a filter if images with this attribute better activate this filter. Attributes are not directly involved in the network so we introduce images as hidden variables. Based on the filter attribute probability density function (p.d.f.), we rank the attributes by re-weighting each filter attribute p.d.f. with class specific weights, a step similar to CAM 

[Zhou et al.2016] or Grad-CAM[Selvaraju et al.2017]. Our final textual explanation is a template sentence with top attributes as supportive evidence.

In order to demonstrate the accuracy of the proposed algorithm, we devise two by-product tasks as sanity checks. The first task is visual attribute grounding. Specifically, we localize the sub-region in an image that is related to a query attribute. Note this task is weakly supervised only by category labels. This is achieved by the linear combination of the final layer feature map according to the filter attribute p.d.f.. The second experiment is visual attribute based image retrieval. For a query attribute, we obtain a list of candidate images containing this attribute. Decent results on these two tasks serve as a strong indicator that the core algorithm works properly.

A direct application of the proposed textual explanation algorithm is to work as a network debugger and generate error messages when the network prediction is wrong. We summarize the three major failure patterns for the fine-grained dataset CUB-200-2011. The first and most common failure pattern is the network fails to identify true discriminative features. The network is confused by small inter-class variation and large intra-class variation. The second failure pattern is the network is not robust to image perturbations, such as color distortion and low image quality. The last failure pattern is caused by incorrect human labels. There’re roughly 4% annotation errors in CUB.

Our main contributions in this paper can be summarized as follows:

  • We propose the novel task of network interpretation using textual summarization. We identify filter attribute p.d.f. as our core problem and propose a Bayesian inference framework to learn it.

  • We devise two tasks for automatic quantitative evaluation of the learned p.d.f.s, demonstrating the accuracy of the proposed algorithm.

  • We employ the proposed framework for network debugging in fine-grained recognition and unveil common failure patterns useful for further improvement.

Figure 2: (A) The filter-attribute association is to find a mapping from the abstract functional space to the semantic attribute space. (B) The composite filter function can be coordinate transformed to be defined on the attribute space. (C) Top activation images of a filter (a yellow head detector).
Figure 3: Textual explanations generated by our method

Related Works

Network Interpretation There are two main approaches to network interpretation in the literature: filter-level interpretation [Erhan et al.2009, Szegedy et al.2013, Mahendran and Vedaldi2015, Nguyen, Yosinski, and Clune2015, Google, Nguyen et al.2016, Nguyen et al.2017, Bau et al.2017, Zhou et al.2014, Yosinski et al.2015, Springenberg et al.2014, Zeiler and Fergus2013] and holistic-level interpretation [Simonyan, Vedaldi, and Zisserman2013, Zhou et al.2016, Selvaraju et al.2017]

. The goal of filter-level interpretation is to understand and visualize the features that specific neurons learn. While it’s easy to directly visualize the first convolutional layer filter weight to get a sense of the patterns they detect, it makes little sense to directly visualize deeper layer filter weights because they act as complex composite functions of lower layers’ operations. Early examples of filter-level understanding include finding the maximally activated input patches 

[Zeiler and Fergus2013] and visualizing the guided back propagation gradients [Springenberg et al.2014]. Some works [Nguyen, Yosinski, and Clune2015] try to synthesize visually pleasant preferred input image of each neuron through back-propagation into the image space. [Nguyen et al.2016] applies a generator network to generate images conditioned on maximally activating certain last-layer neurons. The Plug and Play paper [Nguyen et al.2017] further extends [Nguyen et al.2016] to introduce a generalized adversarial learning framework for filter-guided image generation. Network dissection [Bau et al.2017] measures the interpretability of each neuron by annotating them with predefined attributes like color, texture, part, etc. [Zhang et al.2017]

proposes represent the image content and structure by knowledge graph.

Attempts at holistic summarization mainly focus on visualizing important image subregions by re-weighting final convolutional layer feature maps. Examples include CAM [Zhou et al.2016] and Grad-CAM [Selvaraju et al.2017]. However, the visualization based method only provides coarse-level information, and it remains hard to intuitively know what feature or pattern the network has learned to detect. More importantly, the holistic heat map representation is sometimes insufficient to justify why the network favors certain classes over others when the attentional maps for different classes overlap heavily. See Figure 1 for example.

VQA and Image Caption

Other tasks that combine text generation and visual explanation include image captioning and visual question answering (VQA). Although it sounds like a similar task, image captioning 

[Farhadi et al.2010] is fundamentally different from ours. Image captioning is usually done in a fully supervised manner, with the goal of generating a caption that describes the general content of an image. Our textual interpretation task aims to loyally reflects the knowledge learned by a classification model in an unsupervised way. Visual question answering [Antol et al.2015] is a task that requires understanding the image and answering textual questions. Our task can be viewed as a special case of unsupervised VQA that focuses more specifically on questions such as: ”Why does the model think the image belongs to class X.” Text grounding [Rohrbach et al.2015] is a language to vision task that tries to locate the object in an image referred to by a given text phrase. We note that [Hendricks et al.2016] defines a task similar to ours, to explain and justify a classification model. Their model is learned in a supervised manner, with explanations generated from an LSTM network which only implicitly depends on the internal feature maps. It is essentially an image captioning task that generates captions with more class-discriminative information. Our method is unsupervised and does not rely on another black-box network to generate descriptions.

Fine Grained Recognition

Fine grained recognition aims to discriminate between subcategories like species of birds, dogs and different make and model of cars, aircrafts, etc. The difficulty of fine grained recognition lies in the extremely large intra-class variance and small inter-class variance. Representative works include Bilinear Pooling 

[Lin, RoyChowdhury, and Maji2015], which computes the outer product of the final layer feature maps. Attention based models [Sermanet, Frome, and Real2015] works by focusing attention to discriminative parts of an image object. Part based models [Zhang et al.2014] works by decomposing the image into part features to be readily compared. Fine grained recognition is special because it usually performs better than non-expert humans. It’s therefore interesting to unveil the knowledge it learns towards decision making.

Bayesian Inference Framework

As a fundamental step toward network interpretation, we’re interested in representing network filter with its representing activation patterns in terms of visual attributes. Constructing a paired filter  attribute dataset is unrealistic, because the filter (as a composite function) is not a well defined concept with concrete examples. Instead, we propose leveraging off-the-shelf image attribute annotations because they contain rich textual references to visual concepts. The intuition behind our filter-attribute association is simple: The model filters can be represented by the images that strongly activate them. The corresponding image attributes should have a high probability of representing an activated filter. The joint consensus of all textual attributes from the whole dataset can serve as a good indicator of the filter pattern, provided the network is properly trained.

More formally, the composite filter function takes an image as input and produces a feature map whose strength indicates the existence of certain patterns. The filter interpretation task aims to find a mapping from the abstract functional space to the semantic visual attribute space (Figure 2). With the help of image-level attributes, we consider a coordinate transformation operation that transforms the input of filter function from image to image attributes. The filter-attribute probability distribution obeys multinomial distribution and can be approximated by the attribute probability density function, which is a key component of the proposed algorithm.

Filter Attribute Probability Density Function

We denote as the group of model filters. In this paper, we are only interested in the final convolutional layer filters, as they are the input to the fully connected layer. We denote as the set of input images. The filter’s output is naturally written as , which we call a feature map or filter activation. We consider models [He et al.2017, Huang et al.2017] with a global pooling layer and one fully connected layer. The fully connected layer produces class-label predictions with the weight matrix . A list of textual attributes is attached to each image. We loosely denote if is contained in image ’s attribute list.

We propose a Bayesian inference framework to learn the probability of visual attributes that can represent filter patterns. We call as the filter attribute probability density function (p.d.f) and it can be formulated as a posterior probability:


is the prior probability for visual attribute

. We consider the relative importance of attributes because they carry different information entropy. For example, ”small bird” has less information than ”orange beak” because the latter appears less in the text corpora and corresponds to a more important image feature. We employ the normalized TF/IDF feature as the attribute prior.

measures the likelihood of attribute activating filter . As attributes are not directly involved in the neural network, we introduce input images as latent variables:


measures the likelihood of the image and the attribute is the reason for filter ’s activation. We assume is conditionally independent to given :


where is the normalization function and is the global pooling layer output. The strength of the feature map measures how likely an image can activate a filter. This approximation neglects the fact that when an image activates a filter, the feature map favors certain attributes than others. For example, if the highlights the head area of a bird, attributes related to ”head”, ”beak” or ”eyes” should be assigned with higher probabilities than attributes related to ”wings” and ”feet”. This naive approximation though assigns equal probability to every visual attribute. This approximation actually works decently, as the joint consensus of all input images highlights the true attributes and suppresses false ones. One way to associate the spatial distribution of the feature map with corresponding visual attribute is to exploit other forms of annotations like keypoints or part segmentation. If the feature map overlaps highly with certain part segmentation, higher probability will assigned to the corresponding visual attributes. This approach is dependent on additional forms of human annotations and hinders the generalization of the proposed algorithm, so it’s not used in this paper.

measures the likelihood that is an attribute of image . It takes 1 when is in the attribute list of and 0 otherwise:


Aggregating Filter Attribute p.d.f.s for Holistic Description

With the help of filter attribute p.d.f., we can figure out what features the network has learned for image classification. This problem can be formulated as the probability of visual attributes given the fact that the network produces certain class label for certain input image. We introduce final convolutional layer filters as hidden variables here:


where is the probability that is the reason that the network predicts as class . We assume is conditionally independent to and given . is the filter attribute p.d.f.. measures the importance of a filter in the decision making process:


where is the normalization function, is the weight from the classifier weight matrix connecting filter to class prediction , and is the global pooling layer output. We call the image-class attribute p.d.f..

We generate a natural sentence to describe the network decision-making process using the image-class attribute p.d.f.. Although it’s popular to employ a recurrent model for sentence generation, our task is to faithfully reflect the internal features learned by the network and introducing another network could result in more uncertainty. We instead propose a simple template-based method, which has the following form:

”This is a {class name} because it has {attribute 1}, {attribute 2}, …, and {attribute n}.”

We consider only the top 5 attributes to make the sentence shorter and more precise. Steps are taken to merge adjectives related to the same nouns.

Another important aspect of model interpretation is to compare the reasons behind certain choices as opposed to others, i.e. why the network thinks the input belongs to class instead of . We can easily summarize the relation and the difference between two predictions by comparing their image-class attribute p.d.f.. For example, while both birds have long beaks, the class favors a green crown while the class tends to have a blue crown. An example is shown in Figure. 1.

Explain-Away Mechanism

The filter attribute p.d.f. obeys a multinomial distribution. It does not necessarily activate on only one narrow beam of features. Instead it may behaves like a multi-modal Gaussian distribution that activates on several totally different features. For example, the filter

is likely to detect both ”blue head” and ”black head” with high probability. The interpretability of the filter could suffer from this multi-modal characteristic. This is especially true for the image description task because it becomes hard to know exactly which feature activates the filter.

Figure 4: Example of caption annotations on CUB. The extracted visual attributes are highlighted.

However, we observe that other filters can act in a complimentary way to help explain away the probability of non-related patterns. For instance, there could be another filter activates for ”blue head”, but not for ”black head”. If both filters activate, then the probability of ”blue head” is high. If only the activates, then ”black head” is the more probable pattern. The joint consensus of all the filters makes the generated explanation reasonable.

Class Level Description Given the filter attribute p.d.f.s, we are interested in knowing which features are important for each class. This task can be formulated as the probability of visual attributes given the fact the network predicting an image as class :


where is the filter attribute p.d.f. and we again assume is conditionally independent to given . measure the importance of a filter for class and is simply:


where is the normalization function and is the weight from the classifier weight matrix connecting filter to class prediction .

Different from the image-class attribute p.d.f., class level description weights attributes based only on the classifier weight and the filter attribute p.d.f.. For difficult tasks like fine grained recognition, deep models often perform better than non-expert users. The knowledge distilled from class level description could potentially be used to teach users how to discriminate in challenging domains.

Applications for Textual Summarization

In order to demonstrate the accuracy of the learned p.d.f.s, we devise two by-product tasks, namely visual attribute grounding and attribute-based image retrieval, as sanity checks. Success in these tasks would serve as a strong indicator of the effectiveness of our method. One direct application of the proposed textual summarization algorithm is to understand the network’s failure patterns and provide suggestions for future improvement. We validate the proposed tasks with experiments and provide qualitative and quantitative analysis in the experiment section. Other potential applications are left to future work.

Figure 5: Unnormalized filter attribute p.d.f.s along with the top activation images.

Visual Attribute Grounding Given a query visual attribute, we would like to know which image region it refers to. We show how the filter attribute p.d.f. can help with this task. Suppose is a visual attribute associated with image , we formulate the image region of interests (ROI) as a linear combination of final convolutional layer feature maps:


where is the normalization function and . Intuitively, we re-weight the filter responses according to filter attribute p.d.f.. This task is weakly supervised by image labels with no ground truth ROI  phrase pairs to learn from. If the algorithm fails to learn accurate filter attribute p.d.f., we would expect the grounded region to be less accurate.

Attribute Based Image Retrieval We would like to be able to search a database of images using textual attribute queries, and return images that match. For example we would like to find all images of birds with ”white head” and ”black throat”. The image-class attribute p.d.f. provides a simple method to rank images based on the probability of containing the desired attributes. Given a query visual attribute, we simply return all the images that contains the query in the generated textual explanation sentence.

Network Debugging When the network produces a wrong prediction, we would like to understand the reason. For the fine-grained dataset CUB-200-2001, we generate textural summarization for all failure cases to explain why the network favors the wrong prediction instead of ground truth prediction. We unveil common failure patterns of the network that are helpful for network improvement.


We demonstrate the effectiveness of our algorithm on the fine grained dataset CUB-200-2011 [Wah et al.2011] with 5997 training images and 5797 testing images. Image-level visual attributes can be obtained directly from binary attribute annotation or by image caption [Reed et al.2016] decomposition. We choose the second route because the image captions contain rich and diverse visual attributes better suited for our purpose. One example is shown in Figure 4

. We use as our convolutional model a ResNet-50 which is trained on ImageNet 

[Deng et al.2009] and fine-tuned on CUB. We use bounding-box cropped images to reduce background noise.

Figure 6: Attribute Based Image Retrieval. Each row shows an attribute query on the left, followed by the top-ranked results, in terms of probability that the image contains the query attributes.

Visual Attribute Extraction

We first extract visual attributes from the image captions. We follow the pipeline of word tokenization, part-of-speech tagging and noun-phrase chunking. For simplicity, we only consider adjective-noun type attributes with no recursion. We end up with 9649 independent attributes. The Term Frequency (TF) of phrase is computed as the number of occurrences of in the same captioning file. For CUB, each image has a caption file with 5 different captions. The Inverse Document Frequency (IDF) is where is the the total number of files and in the number of files containing phrase .

Filter Attribute P.D.F. and Textual Summarization

We show examples of filter attribute p.d.f.s in Figure 5. We see a clear connection between the top activated images an the top ranked attributes. This validates our idea of using textual visual attributes to represent the filter pattern. We show examples of generated textual explanations for image classification in Figure 3. We can see that the generated explanations capture the class discriminative information present in the images.

Visual Attribute Grounding In Figure  7, we show examples of query attributes and the generated raw heatmaps indicating what part of the image the visual attribute refers to. Each column denotes a different visual attribute and the heatmap (shown as a transparency map) indicates the region of highest activation within the image. We can see qualitatively that the proposed approach is reasonably good at highlighting region of interests.

As the visual attributes are highly correlated with keypoints, to quantitatively measure the performance of the proposed visual attribute grounding algorithm, we compare the generated heatmap max-value location with ground truth keypoint locations. We present the PCK (percentage of correct keypoints) score for the top 50 most frequent visual attributes with corresponding keypoint annotations. in PCK@ means the predicted location is within the distance of from the ground-truth keypoint. Note that visual attribute grounding is neither supervised by keypoints nor optimized for keypoint detection. We compare with two baseline methods. One is to randomly assign a location for the attribute. The other baseline is similar to the proposed method except that the filter attribute p.d.f. is constant for all attributes. We show in Table 1 that our learned p.d.f. performs better than both baseline methods, demonstrating the accuracy of the proposed algorithm.

PCK@0.1 PCK@0.2 PCK@0.3
Random 3.1% 12.6% 28.3%
Constant p.d.f. 8.5% 28.1% 47.5%
Proposed 12.2% 38.7% 60.9%
Table 1: PCK@ for attribute grounding

Attribute-based Image Retrieval In Figure 6, three examples of attribute-based image search using text-based attributes are shown. Images are ranked from high to low using the probability that the image contains the query attributes. The results are very encouraging – each image clearly contains the query attributes.

We measure the performance of attribute-based image retrieval by comparing it with the ground-truth caption based retrieval for the top 50 attributes as seen in Table 2. Note these numbers are only approximations as the ground-truth caption doesn’t necessarily contain every true attribute in the image. The fact our method performs better than random retrieval demonstrates the accuracy of the underlying image-class attribute p.d.f..

Figure 7: Examples of text grounding. Each column represents a different attribute and examples are shown as heatmaps indicating the region where the attribute is most present.
Recall True Negative Accuracy
30.9% 92.9% 27.9%
Table 2: Image retrieval measurements
Figure 8: Analysis of Network Failures (for Network Debugging). Each row represents a network failure – an incorrectly predicted class label. From left to right, each row shows the query image, canonical images for the ground-truth and incorrectly predicted classes, and explanations for each of these classes. The box below the first row provides background on differences between Tree Sparrows and Chipping Sparrows.

Network Debugging In figure 8, we show three major patterns of network failure through textual summarization. In the first example, a Tree Sparrow is incorrectly recognized as a Chipping Sparrow because the network mistakenly thinks ”long tail” is a discriminative feature. Failing to identify effective features for discrimination is the most common source of errors across the dataset. In fine-grained classification, the main challenge is to identify discriminative features for visually similar classes, differences which are often subtle and localized to small parts.

The second example shows a Seaside Sparrow that has mistakenly been recognized as a Blue Grosbeak. From the textual explanations we ascertain that the low image quality mistakenly activates filters that correspond to blue head and blue crown. The underlying source of this error is complex – the generalization ability of the network is limited such that small perturbations in the image can result in unwanted filter responses. Such failures imply the critical importance of improving network robustness.

In the third case, the network predicts the image as a Yellow Warbler, however the ground-truth label is Yellow-bellied Flycatcher. According to a bird expert, the network got this correct – the ground-truth label is an error. The network correctly identifies the yellow crown and yellow head, both obvious features of the Yellow Warbler. Errors like this are not surprising because, according to [Van Horn et al.2015], roughly 4% of the class labels in the CUB dataset are incorrect. The mistake shown in Figure 1 could also be a false negative and it indicates the classifier may not learn to assign correct weights to discriminative features.

To quantitatively measure the accuracy of generated explanations, we compute their sentence BLEU scores, which measure the similarity of the generated sentence with ground truth caption annotations. We show in Table 3 that, generally, explanations are more accurate for correctly classified images. Our textual explanation isn’t directly optimized to mimic the image caption annotations, but we would expect the explanations for incorrectly classified images contain noisy features and thus less accurate.

Correct Wrong Overall
0.415 0.381 0.409
Table 3: BLUE score


In this paper, we propose a novel task for network interpretation that generates textual summarization justifying the network decision. We use publicly available captioning annotations to learn the filter  attribute relationships in an unsupervised manner. The approach builds on the intuition that filter responses are strongly correlated with specific semantic patterns. Leveraging a joint consensus of attributes across the top-activated images, we can generate the filter-level attribute p.d.f.. This further enables holistic-level explanations by combining visual attributes into a natural sentence. We demonstrate the accuracy of the proposed algorithm by visual attribute grounding and attribute-based image retrieval. We employ the textual explanation as network debugging tool and summarize common failure patterns for fine-grained recognition.

Figure 9: Top 100 most frequent noun phrases.

Future work includes experiments on additional models and datasets. The algorithm can also be generalized to learning from weaker class-level caption annotations. Word embedding methods such as word2vec [Mikolov et al.2013] can be utilized for learning to embed and group semantically similar words together. Keypoint-based annotations can be used to assign different weights for attributes according to the spatial distribution of the feature map. Potential applications include explaining adversarial examples and attribute-based zero-shot learning.


  • [Antol et al.2015] Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Lawrence Zitnick, C.; and Parikh, D. 2015. Vqa: Visual question answering. In The IEEE International Conference on Computer Vision (ICCV).
  • [Bau et al.2017] Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Network dissection: Quantifying interpretability of deep visual representations. In

    Computer Vision and Pattern Recognition

  • [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248–255. IEEE.
  • [Erhan et al.2009] Erhan, D.; Bengio, Y.; Courville, A.; and Vincent, P. 2009. Visualizing higher-layer features of a deep network. Technical Report 1341, University of Montreal.
  • [Farhadi et al.2010] Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, David”, e. K.; Maragos, P.; and Paragios, N. 2010. Every picture tells a story: Generating sentences from images. In Computer Vision – ECCV 2010, 15–29. Berlin, Heidelberg: Springer Berlin Heidelberg.
  • [Goodfellow, Shlens, and Szegedy2014] Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and Harnessing Adversarial Examples. ArXiv e-prints.
  • [Google] Google. Inceptionism: Going deeper into neural networks.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In CVPR.
  • [He et al.2017] He, K.; Gkioxari, G.; Dollar, P.; and Girshick, R. 2017. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV).
  • [Hendricks et al.2016] Hendricks, L. A.; Akata, Z.; Rohrbach, M.; Donahue, J.; Schiele, B.; and Darrell, T. 2016. Generating visual explanations. In European Conference on Computer Vision, 3–19. Springer.
  • [Huang et al.2017] Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012.

    ImageNet Classification with Deep Convolutional Neural Networks.

    In NIPS.
  • [Lin, RoyChowdhury, and Maji2015] Lin, T.-Y.; RoyChowdhury, A.; and Maji, S. 2015. Bilinear CNN Models for Fine-Grained Visual Recognition. In ICCV.
  • [Mahendran and Vedaldi2015] Mahendran, A., and Vedaldi, A. 2015. Understanding deep image representations by inverting them. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Mikolov et al.2013] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • [Nguyen et al.2016] Nguyen, A.; Dosovitskiy, A.; Yosinski, J.; Brox, T.; and Clune, J. 2016. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems 29.
  • [Nguyen et al.2017] Nguyen, A.; Clune, J.; Bengio, Y.; Dosovitskiy, A.; and Yosinski, J. 2017. Plug & play generative networks: Conditional iterative generation of images in latent space. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Nguyen, Yosinski, and Clune2015] Nguyen, A.; Yosinski, J.; and Clune, J. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Reed et al.2016] Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; and Lee, H. 2016. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396.
  • [Rohrbach et al.2015] Rohrbach, A.; Rohrbach, M.; Hu, R.; Darrell, T.; and Schiele, B. 2015. Grounding of Textual Phrases in Images by Reconstruction. ArXiv e-prints.
  • [Selvaraju et al.2017] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV).
  • [Sermanet, Frome, and Real2015] Sermanet, P.; Frome, A.; and Real, E. 2015. Attention for Fine-Grained Categorization. In ICLR.
  • [Simonyan and Zisserman2015] Simonyan, K., and Zisserman, A. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
  • [Simonyan, Vedaldi, and Zisserman2013] Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. ArXiv e-prints.
  • [Springenberg et al.2014] Springenberg, J. T.; Dosovitskiy, A.; Brox, T.; and Riedmiller, M. 2014. Striving for Simplicity: The All Convolutional Net. ArXiv e-prints.
  • [Szegedy et al.2013] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. ArXiv e-prints.
  • [Van Horn et al.2015] Van Horn, G.; Branson, S.; Farrell, R.; Haber, S.; Barry, J.; Ipeirotis, P.; Perona, P.; and Belongie, S. 2015. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 595–604.
  • [Wah et al.2011] Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001.
  • [Yosinski et al.2015] Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.; and Lipson, H. 2015. Understanding Neural Networks Through Deep Visualization. ArXiv e-prints.
  • [Zeiler and Fergus2013] Zeiler, M. D., and Fergus, R. 2013. Visualizing and Understanding Convolutional Networks. ECCV.
  • [Zhang et al.2014] Zhang, N.; Donahue, J.; Girshick, R.; and Darrell, T. 2014. Part-Based R-CNNs for Fine-Grained Category Detection. In ECCV.
  • [Zhang et al.2017] Zhang, Q.; Cao, R.; Shi, F.; Nian Wu, Y.; and Zhu, S.-C. 2017. Interpreting CNN Knowledge via an Explanatory Graph. AAAI.
  • [Zhou et al.2014] Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba, A. 2014. Object Detectors Emerge in Deep Scene CNNs. ICLR.
  • [Zhou et al.2016] Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba, A. 2016.

    Learning deep features for discriminative localization.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).