Log In Sign Up

Visual Explanation by Interpretation: Improving Visual Feedback Capabilities of Deep Neural Networks

by   Jose Oramas, et al.

Learning-based representations have become the defacto means to address computer vision tasks. Despite their massive adoption, the amount of work aiming at understanding the internal representations learned by these models is rather limited. Existing methods aimed at model interpretation either require exhaustive manual inspection of visualizations, or link internal network activations with external "possibly useful" annotated concepts. We propose an intermediate scheme in which, given a pretrained model, we automatically identify internal features relevant for the set of classes considered by the model, without requiring additional annotations. We interpret the model through average visualizations of these features. Then, at test time, we explain the network prediction by accompanying the predicted class label with supporting heatmap visualizations derived from the identified relevant features. In addition, we propose a method to address the artifacts introduced by strided operations in deconvnet-based visualizations. Our evaluation on the MNIST, ILSVRC 12 and Fashion 144k datasets quantitatively shows that the proposed method is able to identify relevant internal features for the classes of interest while improving the quality of the produced visualizations.


page 1

page 2

page 4

page 5

page 6

page 7

page 8


Explaining with Examples: Lessons Learned from Crowdsourced Introductory Description of Information Visualizations

Data visualizations have been increasingly used in oral presentations to...

Assessing the Reliability of Visual Explanations of Deep Models with Adversarial Perturbations

The interest in complex deep neural networks for computer vision applica...

A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations

Backpropagation-based visualizations have been proposed to interpret con...

Saliency-driven Class Impressions for Feature Visualization of Deep Neural Networks

In this paper, we propose a data-free method of extracting Impressions o...

Analyzing Learned Convnet Features with Dirichlet Process Gaussian Mixture Models

Convolutional Neural Networks (Convnets) have achieved good results in a...

Manifold: A Model-Agnostic Framework for Interpretation and Diagnosis of Machine Learning Models

Interpretation and diagnosis of machine learning models have gained rene...

Towards Generalization on Real Domain for Single Image Dehazing via Meta-Learning

Learning-based image dehazing methods are essential to assist autonomous...

1 Introduction

Figure 1: Visual explanations generated by our method. Predicted class labels are enriched with heatmaps indicating the pixel locations, associated to the features, that contributed to the prediction. Note these features may come from the object itself as well as from its context. On top of each heatmap we indicate the number of the layer where the features come from. The layer type is color-coded (green for convolutional and pink for fully connected).

In recent years, methods for learning-based representations based on deep neural networks (DNNs ) have achieved impressive results in several computer vision tasks, e.g. image classification [4, 10, 13], object detection [19, 29], image generation [3, 7], etc. This, combined with the general trend in the Computer Vision community of developing methods with a focus on high quantitative performance, has motivated the massive adoption of methods based on DNNs , which produce impressive quantitatively accurate predictions, despite their black-box characteristics. In this work, we aim for more visually-descriptive predictions and propose means to improve the quality of the visual feedback capabilities of DNN -based methods. Our goal is to bridge the gap between methods aiming at model interpretation, i.e. understanding what a given pre-trained model has actually learned, and methods aiming at model explanation, i.e. justifying the decisions made by a model.

Model interpretation of DNNs is commonly achieved in two ways: either by a) manually inspecting heatmap visualizations of every single filter from every layer of the network [27, 28], or , more recently, by b) exhaustively matching the internal activations produced by a given model w.r.t. a dataset with pixel-wise annotations of possibly relevant concepts [2]. These two paths have provided useful insights on the internal representations learned by DNNs . However, they have their own weaknesses. For the first case, the manual inspection of filter responses introduces a subjective bias, as was recently evidenced by [8]. In addition, the inspection of every filter on every layer becomes a cognitive-expensive practice for deeper models, which lends itself to noise. For the second case, as stated in [2], the interpretation capabilities over the network are limited by the concepts for which annotation is available. Moreover, the cost of adding annotations for new concepts is quite high due to its pixel-wise nature. A third weakness, that both cases share, is inherited by the way in which they generate spatial filter-wise responses, i.e. either through deconvolution-based heatmaps [22, 23, 28] or by up-scaling the activation maps at a given layer/filter to the image space [2, 30]. On the one hand, deconvolution-based methods are able to produce heatmaps with high level of detail from any filter in the network. However, as can be seen in Fig. 2 they suffer from artifacts introduced by strided operations in the back-propagation process. On the other hand, up-scaled activation maps can significantly lose details when displaying the response of filters with large receptive field from deeper layers. Moreover, they have the weakness of only being computable from convolutional layers.

In order to alleviate these issues, we start from the hypothesis, empirically proven by previous work [2, 27], that some of the internal filters of a network encode features that are important for the task that the network addresses. Based on that assumption, we propose a method in which given a trained DNN model, we automatically identify a set of relevant internal filters whose encoded features serve as an indicator for the class of interest to be predicted. These filters can originate from any type of internal layer of the network, i.e. convolutional, fully connected, etc. This is formulated as a u-Lasso optimization problem in which a sparse set of filter-wise responses are linearly combined in order to predict the class of interest. At test time, given a test image, a set of identified relevant filters, and a class prediction, we accompany the predicted class label with heatmap visualizations of the top-responding relevant filters for the predicted class, see Fig. 1. In addition, by improving the resampling operations within deconvnet-based methods [23, 28], we are able to address the artifacts introduced in the back-propagation process. The code and models used to generate our visual explanations will be released upon the publication of this manuscript.

Figure 2: Visualization comparison. Note how our heatmaps attenuate the grid-like artifacts introduced by deconvnet-based methods at lower layers. Likewise, our method is able to produce a more detailed visual feedback than upsampled activation maps.

Overall, the proposed method removes the requirement of additional expensive pixel-wise annotation, by relying on the same annotation used to train the initial model. Moreover, by using our own variant of a deconvolution-based method, our method is able to consider the spatial response from any filter at any layer while still providing visually pleasant feedback. This allows our method to reach some level of explanation by interpretation.

From a practical point of view, algorithms capable of communicating, as part of their output, the train of thought taken to reach a particular prediction are more likely to be trusted and adopted by end users than systems that operate in a black-box fashion. Moreover, findings and techniques developed in this direction may assist the overhaul required for current machine learning technology in order to meet recently approved legislation, i.e. EU General Data Protection Regulation (art. 13-15 & 22). This legislation requires automated decision-making processes to provide the logic involved, i.e. an explanation, followed to reach a particular decision.

The main contributions of this work are:

  • [leftmargin=*]

  • An automatic non-exhaustive method based on relevant feature selection to identify the network-encoded features that are important for the prediction of a given class. This alleviates the requirement of manual inspection and additional expensive pixel-wise annotations required by existing methods.

  • The proposed method is able to provide visual feedback with higher-level of detail over up-scaled raw activation maps [2, 30], and improved quality over recent deconvolution-based methods with guided back-propagation [23, 28].

  • The proposed method is general enough to be applied to any type of network, independently of the type of layers that compose it.

This paper is organized as follows: in Section 2 we position our work with respect to existing work. Section 3 presents the pipeline and inner-workings of the proposed method. In Section 4, we conduct a series of experiments evaluating different aspects of the proposed method and its performance on an application case while discussing the observations and findings made throughout our evaluation. Finally, in Section 5, we conclude this paper.

2 Related Work

This work lies at the intersection between model interpretation and explanation for DNNs . These two groups of work constitute the axes along which we position our work.

Interpretation. Input modification methods [9, 28, 31] were one of the earlier attempts towards DNN interpretation in recent time. These methods see the network as a black-box. They are designed to visualize properties of the function this black-box represents by systematically covering (part of) the input image and measuring the difference of activations. Their assumption is that occlusion of important parts of the input will lead to a significant drop in performance. This procedure can be applied at test time to identify the regions of the image that are important for classification. However, the precision of the explanation will depend on the region size, which will inversely increase computation cost. We apply the assumption made by this group of work in our evaluation in order to verify the relevance of the internal network features selected by our method.

A more recent group of works focuses on linking internal DNN activations with semantic concepts. [6]

proposed a feature selection method in which the neuron activations of a

DNN trained with object categories are linearly combined to predict object attribute classes. Following a similar idea, [2] proposed to exhaustively match the internal activations of every filter from the convolutional layers against a dataset with pixel-wise annotated concepts. While both methods provide important insights on the semantic concepts encoded by the filters of the network, they are both limited by the concepts for which annotation is available. Similar to [6], we discover relevant internal filters through a feature selection method. Different from [6], we link internal network activations directly to the same annotations used to train the initial model. This effectively removes the expensive requirement of additional annotations.

A third line of works aims at discovering frequent mid-level visual patterns [5, 15, 18]

occurring in image collections. The constraints enforced during the mining of these mid-level elements empowers them with high semantic coverage, making them effective for training classifiers to address computer vision problems or as means for summarization. We adopt the idea of using visualizations of (internal) mid-level elements as means to reveal the relevant features encoded, internally, by a

DNN . More precisely, we use the average visualizations used by these works in order to interpret, visually, what the network has actually learned.

Explanation. For sake of brevity, we ignore methods which generate explanations via bounding boxes [12, 17] or text [11], and focus on methods capable of generating visualizations with pixel-level precision. Works belonging to this group map the activity at a given layer/filter back to the input image space. [28] proposed a multi-layered deconvolutional network (Deconvnet) which uses activations from a given top layer and reverses the path of excitatory stimuli to reveal which visual patterns from the input image are responsible for the observed activations. In parallel, [22]

proposed a variation of the deconvnet where information from the lower layers and the input image is used to estimate which image regions are responsible for the activations seen at the top layers. In a similar direction,

[1] aims at decomposing the classification decision into a pixel-wise contribution while enforcing a layer-wise conservation principle in which propagated quantity should be preserved between neurons from adjacent layers. Later, [23] extended these by introducing “guided back-propagation”, a technique that removes the effect of units with negative activations during the forward pass and with negative contributions in the backwards pass. This resulted in improvements in sharpness and clarity of the generated visualizations. Despite the terminology used, the main difference between [22, 23, 28]

lies in the way in which the ReLU operation is handled during the backwards pass in order to introduce the desired effect.

[30] proposes to ignore fully connected layers and uses a Global Average Pooling (GAP) operation at the end of the last convolutional layer. This GAP operation is a weighted sum over the spatial locations of the activations of the filters of the last convolutional layer, which results in a class activation map. Finally, a heatmap is generated by upsampling the class activation map to the size of the input image.

Here, we take the deconvnet-based methods as starting point given their maturity and their ability to produce visual feedback with pixel-level precision. In addition, we replace internal operations in the backward pass with the goal of reducing visual artifacts introduced by strided operations while maintaining the network structure.

3 Proposed Method

The proposed method consists of two parts. At training time, a set of relevant layer/filter pairs are identified for every class of interest . Thus, producing a relevance weight , associated to class , for every filter-wise response computed internally by the network. At test time, an image is pushed through the network producing the class prediction . Then, taking into account the internal reponses , and relevance weights for the predicted class , we generate visualizations indicating the image regions that contributed to the prediction .

3.1 Identifying Relevant Features

One of the strengths of deep models is their ability to learn abstract concepts from simpler ones. That is, when an example is pushed into the model, a conclusion concerning a specific task can be reached as a function of the results (activations) of intermediate operations at different levels (layers) of the model. These intermediate results may hint at the “semantic” concepts that the model is taking into account when making a decision. From this observation, we make the assumption that some of the internal filters of a network encode features that are important for the task that the network addresses. To this end, we follow a procedure similar to [6], aiming to reconstruct each class by the linear combination of its internal activations .

As an initial step, we extract the image-wise response by computing the norm of each channel (filter response) within each layer and produce a 1-dimensional descriptor by concatenating the responses from different channels. This descriptor is -normalized in order to compensate for the difference in length among different layers. Finally, is produced by concatenating all the layer-specific descriptors. We do not consider the last layers whose output is directly related to the classes of interest.

Following the procedure from above, we construct the matrix by passing each of the training images through the network and collecting the internal responses . As such, the

image of the dataset is represented by a vector

defined by the filter-wise responses at different layers. Furthermore, the possible classes that the image belongs to are organized in a binary vector where is the total number of classes of interest. Putting the annotations from all the images together produces the binary label matrix , with . With these terms in place, we resort to solving the equation:


with a parameter that allows controlling the sparsity. This is the matrix form of the -Lasso problem. This problem can be efficiently solved using the Spectral Gradient Projection method [16, 24]. After solving the -Lasso problem, we have a matrix , with . We impose sparsity on by enforcing the constraints on the norm of , i.e. . As a result, each non-zero element in represents a pair of network layer and filter index (within the layer) of relevance.

Figure 3: An example of the visual feedback provided by our method. For each input image (left) we accompany the predicted class label with heatmaps indicating the pixel locations, associated to the relevant features, that contributed to the prediction. Note that the relevant features come from the object itself as well as from its context. On top of each heatmap we indicate the number of the layer where the features come from. The layer type is color-coded , i.e. convoluvional (green) and fully connected (pink).

3.2 Generating Visual Feedback

During training time (Section 3.1), we identified a set of features with relevance weights for the classes of interest. At test time, we generate the feedback visualizations by taking into account the effect of the relevant features on the content of the tested images.

Towards this goal, we push an image through the network producing the class prediction . In parallel, we compute the internal filter-wise response vector following the procedure presented above. Then we compute the weighted response , where represents the element-wise product between two vectors. The features, i.e. layer/filter pairs , with strongest contribution in the prediction are selected as those with maximum response in

. Finally, we feed this information to the deconvnet-based method with guided backpropagation 

[23] from [9] to visualize the important features as defined by the layer/filter pairs . Following the visualization method from [9], given a filter from layer and an input image, we first push forward the input image through the network, storing the activations from each filter at each layer, until reaching the layer . Then, we backpropagate the activations from filter at layer with inverse operations until reaching back to the input image space. As as result we get as part of the output a set of heatmaps, associated to the relevant features, defined by , indicating the relative influence of the pixels that contributed to the prediction. See Fig.3 for an example of the visual feedback provided by our method. Please refer to [9, 23, 28] for further details regarding deconvnet-based methods.

Figure 4: Heatmap visualization at lower layers of VGG -F . Note how our method attenuates the grid-like artifacts introduced by existing deconvnet-based (Deconv+GB) methods [23, 28].

3.3 Improving Visual Feedback Quality

Deep neural networks addressing computer vision tasks commonly push the input visual data through a sequence of operations. A common trend of this sequential processing is that the input data is internally resampled until reaching the desired prediction space. As mentioned in Sec. 2, methods aiming at interpretation/explanation start from an internal point in the network and go backwards until reaching the input space - producing a heatmap. However, due to the resampling process, heatmaps generated by the backwards process tend to display grid-like artifacts. More precisely, we find that this grid effect is caused by the internal resampling introduced by network operations with stride larger than one (). To alleviate this effect, in the backwards pass, we set the stride and compensate this change by modifying the input accordingly. As a result, the backwards process can be executed while maintaining the network structure.

More formally, given a network operation block defined by a convolution mask with size , stride

, and padding

, the relationship between the size of its input and its output (see Fig. 5) is characterized by the following equation:


from where,


Our method starts from the input (), which encodes the contributions from the input image, carried by the higher layer in the deconvnet backward pass. In order to enforce a “cleaner” resampling when , during the backward pass, the size of the input () of the operation block should be the same as that of the feature map () produced by the forward pass if the stride was equal to one, i.e. . According to Eq. 3, if , then should be resampled to

. We do this resampling via the nearest-neighbor interpolation algorithm given its proven fast computation time which makes it optimal for real-time processing. Moreover, since in this setting the changes in scale are relatively minor, their known weaknesses are avoided. By introducing this step, the network will perform the backwards pass with stride

and the grid effect will disappear. See Fig. 4 for some examples of the improvements introduced by our method.

Figure 5: To attenuate artifacts produced by strided operations, during the backward pass, we set the stride to 1 () and compensate by resampling the input so that .

4 Evaluation

We conduct three sets of experiments. In the first experiment (Sec. 4.1), we verify the importance of the identified relevant features in the task addressed by the network. In the second experiment (Sec. 4.2), we qualitatively evaluate the improvements on visual quality provided by our method. Finally, in the third experiment (Sec. 4.3), we verify the potential of the proposed method on an application case.

Evaluation Protocol. We evaluate the proposed method on an image recognition task. Towards this goal, we conduct experiments on two standard image recognition datasets, i.e. MNIST  [14], and imageNet  (ILSVRC ’12) [20]. Additionally, we conduct experiments on a subset of cat images from imageNet (imageNet -cats). MNIST contains 10 classes: the hand-written digits from and . It is composed by 70k images in total. From these, 60k are used for training/validation, the rest 10k images are for testing. The imageNet dataset is composed of 1k classes. Following the standard practice, we measure performance on its validation set. Each class contains 50 validation images. The imageNet -cats subset consists of 13 different cat classes, containing both domestic and wild cats. It is composed of 17,550 images. Each class contains 1,3k images for training and 50 images for testing.

Implementation Details. We use in our experiments the pre-trained models provided as part of the MatconvNet framework [25] for both the MNIST and ImageNet datasets. For MNIST 

, we employ a network composed by 8 layers in total, five of them are convolutional , two are fully connected. The last one is a softmax layer. For the full

imageNet set, we employ a VGG - [4] model which is composed of 21 layers, from these 15 are convolutional followed by five fully connected. The last one is a softmax layer. Finally, for the case of the imageNet -Cats subset we finetune the VGG -model trained on the full imageNet set.

Figure 6: Changes in mean classification accuracy (mCA) as a percentage of identified relevant filters is ablated.
Figure 7: Average Images from the identified relevant filters for the ImageNet-Cats subset (top) and some selected classes from the full imageNet (bottom).

ImageNet-Cats (ILSVRC’12)

Figure 8: Generated visual explanations. We accompany the predicted class label with our heatmaps indicating the pixel locations, associated to the features, that contributed to the prediction. Note that the features may come from the object itself as well as from its context. See how for the MNIST examples, some features support the existence of gaps, as to avoid confusion with another class. On top of each heatmap we indicate the number of the layer where the features come from. The layer type is color-coded, i.e. convolutional (green) and fully connected (pink).

4.1 Importance of Identified Relevant Features

In this experiment we verify the importance of the “relevant” features identified by our method at training time, see Sec. 3.1. To this end, given a set of identified features we evaluate the influence they have in the network by measuring changes in classification performance. To this end, we iteratively remove specific features in the network by setting their corresponding layer/filter to zero. We perform this removal in decreasing order of relevance. The expected behavior is that a set of features with higher relevance will produce a stronger drop in performance when ablated. In Fig. 6, we show the changes in classification performance for the tested datasets. We report the performance of three sets of features: a) All, selected by considering the whole internal network architecute, b) OnlyConv, selected by considering only the convolutional layers of the network, and c) a  Random selection of features in the network. Note that the OnlyConv method, makes the assumption that relevant features are only present in the convolutional layers. This is a similar assumption as the one made by state-of-the-art methods [2, 30]. When performing feature selection (Sec.3.1), we set the sparsity parameter for MNIST and imageNet-Cats, and for the complete imageNet. This produces subsets of 4628, 92101, 99857760 relevant features for the AllOnlyConv methods, on the respective datasets. Differences in the number of the selected features can be attributed to possibly redundant or missing predictive information between the initial pools of filter responses used to select the All and OnlyConv features. For Random, we do a random selection of a number of filter responses equal to that of All and report average performance over 10 trials.

A quick inspection of Fig. 6 shows that indeed classification performance drops as we remove the identified features, All and OnlyConv. Moreover, it is noticeable that a random removal of features has minimal effects on classification accuracy. This demonstrates the relevance of the identified features for the classes of interest.

In addition, it is visible that the method that considers the complete internal structure, i.e. All, suffers a stronger drop in performance compared to the OnlyConv which only considers features produced by the convolutional layers. This suggests that there is indeed important information encoded in the fully connected layers, and while convolutional layers are a good source for features, focussing on them does not reveal the full story.

But… does it make sense? In order to get a qualitative insight of the type of information that these features encode we compute an average visualization by considering the top 100 image patches where such features have a high response. Towards this goal, given the set of identified relevant features, for every class, we select images with higher responses. Then, we take the input image at the location with maximum response for a given filter and crop it by considering the receptive of the corresponding layer/filter of interest. Selected examples of average images, with rich semantic representation, are presented in Fig. 7 for the full imageNet and the imageNet-Cats subset.

We can notice that for imageNet-Cats, the identified features cover descriptive characteristics of the considered cat classes. For example, The dark head of a siamese cat, the nose/mouth of a cougar, or the fluffy-white body shape of persian cat. Likewise, it effectively identifies the descriptive fur patterns from the jaguar, leopard and tiger classes and colors which are related to the background. We see a similar effect on a selection of other objects from the rest of the imageNet dataset. For instance, for scene-type classes, i.e. coast, castle and church, the identified features focus on the outline of such scenes. Similarly, we notice different viewpoints for animal-type classes, i.e. golden-retriever, hen, robin, magpie

Finally, in Fig. 8 we show some examples of the visual explanations produced by our method. We aggregate the predicted class label with our heatmap visualizations indicating the pixel locations, associated to the relevant features, that contributed to the prediction. For the case of the ILSVRC ’12 examples, we notice that the relevant features come from the object itself as well as from its context. For the case of the MNIST examples, in addition to the features firing on the object, there are features that support the existence of a gap (background), as to emphasize that the object is not filled there and avoid confusion with another class. For example, see for class how it speaks against and for how it goes against .

4.2 Visual Feedback Quality

In this section we verify the visual quality of the visualizations generated by the proposed method as part of the prediction feedback. Towards this goal, we compare our visualizations with upsampled activation maps from internal layers [2, 30] and the output of deconvnet combined with guided-backpropagation [23]. In Fig. 9 we present qualitative examples showing the heatmap visualization for different methods. We show these visualization for different layers-filters locations throughout the network.

A quick inspection reveals that, our method to attenuate the grid-like artifacts introduced by deconvnet methods (see Sec 3.3) indeed produces noticeable improvements, for lower layers. See Fig. 4 for additional examples presenting this difference at lower layers. Likewise, for the case of higher layers (Fig. 9), the proposed method provides a more precise visualization when compared to upsampled activation maps. In fact, the rough output produced by the activation maps at higher layers has a saliency-like behavior that gives the impression that the network is focusing on a larger region of the image. This could be a possible contribution to why in earlier works [31], manual inspection of network activations suggested that the network was focusing on “semantic” parts. Please see [8] for an in-depth discussion of this observation. Finally, for the case of FC layers, using upsampled activation maps is not applicable. Please refer to the supplementary material for additional examples.

Figure 9: Pixel effect visualization for different methods. Note how for lower layers (8/21), our method attenuates the grid-like artifacts introduced by deconvnet methods. Likewise, for higher layers (15/21), our method provides a more precise visualization when compared to upsampled activation maps. For the case of FC layers (20/21), using upsampled activation maps is not applicable.

4.3 Application Case: Visual Geolocation

In this section we test the capabilities of the proposed method in a more realistic application setting. Towards this goal, we focus on the image-based geolocation task. Similar to [26], we approach this task as a classification problem where given an image, the goal is to predict the location class where the image was taken.

We test our method on a subset of the Fashion 144k dataset [21] composed by 12k images covering 12 city classes. A notable characteristic of this dataset is that consists of images whose content is mostly focused on a single individual. Moreover, many of these images are taken indoors, thus providing reduced contextual information. This makes this application different from previous geolocation works which focus on streetview-like images.

In order to have a reference related to the difficulty of the geolocation task, we conducted a survey asking people to determine where a given photo was taken. Each time, one image is presented to the participant and the participant is asked to select one city from the list of 12 possibilities. There were 123 participants in the survey. We conduct experiments using a finetuned VGG - [4] model pretrained on imageNet (ILSVRC ’12) [20]. We perform feature identification with sparsity which produced 120 relevant features. Similar to Sec. 4.1, in Fig. 11 we show a quantitative results regarding the effect that ablating the identified relevant features from the network has on classification performance. In addition, we add the absolute mean performance achieved by the participants of our survey (Human). Fig. 11 shows qualitative results of our experiments.

Discussion A quick glance at Fig. 11 reveals that the DNN methods outperform Human by a very wide margin. Moreover, Human achieves a very low performance (), which is slightly above random guess (). This further motivates us to look deeper and find out what cues the DNNs used that may be overlooked by humans. We can also see in Fig. 11, the same two trends seen in previous experiments: i) classification accuracy significantly drops as we ablate the identified features, and ii) in this setting, there is important information gains by taking into account the FC layers over the convolutional layers (OnlyConv) alone. This shows the relevance of features (All) beyond the convolutional layers (OnlyConv).

From the average visualizations (Fig. 11) we can notice that for ’Los Angeles’, ’Miami’ classes, features related to uncovered legs and skin color are of importance. Similarly, for other classes, e.g. ’Melbourne’, vegetation seems to be quite common, hence, its high correlation with green color. For ’London’ and ’Paris’ the relevant features seem to focus on covered legs, while others focus on the upper body part. For the ’North Europe region’, the color white seems to be a strong feature, present on walls and in natural landscapes. For this same region, dressing in dark colors seems also to be a descriptive feature. It is interesting that some of these average images depict upper body parts, some focus on persons with dark long hair, short hair, and light hair. Thus, describing some geographic trends.

Our average visualization provides a clear answer to the question of “what has the network actually learned?”. It shows that the model effectively exploits human-related features (legs clothing, hair length/color, clothing color) as well as background-related features (mainly mainly covered by color/gradients and texture patterns), that may have been considered irrelevant by the surveyed participants. Our visual explanations (Fig 11 (bottom)) show that the model effectively uses this type of features to reach its decision. This could be the possible cause why the computer outperformed the surveyed participants by such a margin.

Figure 10: Changes in mean classification accuracy (mCA) as a percentaeg of relevant filters is ablated.

Relevant Features - Average Images
Generated Visual Explanations

Figure 11: Average Images from the identified relevant filters (top) and generated visual explanations justifying the prediction made by the network (bottom).

5 Conclusion

We propose a method to enrich the prediction made by deep neural networks by indicating the features that contributed to such prediction. This enables our method with visual explanation capabilities. Our method identifies features internally encoded by the network that are relevant for the task originally addressed by the network. Moreover, it allows interpretation of these features by the generation of average feature-wise visualizations. In addition, we have proposed a method to attenuate the artifacts introduced by strided operations in visualizations made by deconvnet-based methods This empowers our method with richer visual feedback with pixel-level precision without requiring additional annotations for supervision. Future work will focus on linking the identified relevant features with text-based representations with the goal of enriching our visual explanations with human-readable text explanations.

Acknowledgments: This work was partially supported by the KU Leuven PDM Grant PDM/16/131, and a NVIDIA Academic Hardware Grant.


  • [1] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):e0130140, 07 2015.
  • [2] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. In

    Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [3] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Neural Information Processing Systems (NIPS), 2016.
  • [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: delving deep into convolutional nets. In British Machine Vision Conference (BMVC), 2014.
  • [5] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros. What makes paris look like paris? Communications of the ACM, 58(12):103–110, 2015.
  • [6] V. Escorcia, J. C. Niebles, and B. Ghanem. On the relationship between visual attributes and convolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2015.
  • [7] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. S. Torr, and P. K. Dokania. Multi-agent diverse generative adversarial networks. In arXiv 1704.02906, 2017.
  • [8] A. Gonzalez-Garcia, D. Modolo, and V. Ferrari.

    Do semantic parts emerge in convolutional neural networks?

    International Journal of Computer Vision (IJCV), pages 1–19, 2017.
  • [9] F. Grün, C. Rupprecht, N. Navab, and F. Tombari. A taxonomy and library for visualizing learned features in convolutional neural networks. In International Conference on Machine Learning (ICML) Workshops, 2016.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
  • [11] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. Generating visual explanations. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, European Conference on Computer Vision (ECCV), 2016.
  • [12] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):664–676, April 2017.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems (NIPS), 2012.
  • [14] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.
  • [15] Y. Li, L. Liu, C. Shen, and A. v. d. Hengel. Mining mid-level visual patterns with deep cnn activations. International Journal of Computer Vision (IJCV), 121(3):344–364, 2017.
  • [16] J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision processing. Found. Trends. Comput. Graph. Vis., 8(2-3):85–283, 2014.
  • [17] J. Oramas M. and T. Tuytelaars. Modeling visual compatibility through hierarchical mid-level elements. CoRR, abs/1604.00036, 2016.
  • [18] K. Rematas, B. Fernando, F. Dellaert, and T. Tuytelaars. Dataset fingerprints: Exploring image collections through data mining. In Computer Vision and Pattern Recognition (CVPR), 2015.
  • [19] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), 2015.
  • [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [21] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun. Neuroaesthetics in fashion: modeling the perception of fashionability. In Computer Vision and Pattern Recognition (CVPR), 2015.
  • [22] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In International Conference on Learning Representations (ICLR) Workshops, 2014.
  • [23] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. Striving for simplicity: the all convolutional net. In International Conference on Learning Representations (ICLR) Workshops, 2015.
  • [24] E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM J. Sci. Comput., 31(2):890–912, 2008.
  • [25] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In MM, 2015.
  • [26] T. Weyand, I. Kostrikov, and J. Philbin. Planet - photo geolocation with convolutional neural networks. In European Conference on Computer Vision (ECCV), 2016.
  • [27] J. Yosinski, J. Clune, A. M. Nguyen, T. J. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. 2015.
  • [28] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), 2014.
  • [29] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang, H. Zhou, and X. Wang. Crafting gbd-net for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2017.
  • [30] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba.

    Learning Deep Features for Discriminative Localization.

    In Computer Vision and Pattern Recognition (CVPR), 2016.
  • [31] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. In International Conference on Learning Representations (ICLR), 2015.