Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations

10/23/2020 ∙ by Judy Borowski, et al. ∙ 13

Feature visualizations such as synthetic maximally activating images are a widely used explanation method to better understand the information processing of convolutional neural networks (CNNs). At the same time, there are concerns that these visualizations might not accurately represent CNNs' inner workings. Here, we measure how much extremely activating images help humans to predict CNN activations. Using a well-controlled psychophysical paradigm, we compare the informativeness of synthetic images (Olah et al., 2017) with a simple baseline visualization, namely exemplary natural images that also strongly activate a specific feature map. Given either synthetic or natural reference images, human participants choose which of two query images leads to strong positive activation. The experiment is designed to maximize participants' performance, and is the first to probe intermediate instead of final layer representations. We find that synthetic images indeed provide helpful information about feature map activations (82 However, natural images-originally intended to be a baseline-outperform synthetic images by a wide margin (92 are faster and more confident for natural images, whereas subjective impressions about the interpretability of feature visualization are mixed. The higher informativeness of natural images holds across most layers, for both expert and lay participants as well as for hand- and randomly-picked feature visualizations. Even if only a single reference image is given, synthetic images provide less information than natural images (65 popular synthetic images from feature visualizations are significantly less informative for assessing CNN activations than natural images. We argue that future visualization methods should improve over this simple baseline.



There are no comments yet.


page 1

page 9

page 10

page 11

page 12

page 13

page 15

page 18

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As Deep Learning methods are being deployed across society, academia and industry, the need to understand their decisions becomes ever more pressing. A “right to explanation” is even required by law in the European Union

(GDPR, 2016; Goodman and Flaxman, 2017). Fortunately, the field of interpretability or

explainable artificial intelligence

(XAI) is also growing: Not only are discussions on goals and definitions of interpretability advancing (Doshi-Velez and Kim, 2017; Lipton, 2018; Gilpin et al., 2018; Murdoch et al., 2019; Miller, 2019; Samek et al., 2020) but the number of explanation methods is rising, their maturity is evolving (Zeiler and Fergus, 2014; Ribeiro et al., 2016; Selvaraju et al., 2017; Kim et al., 2018) and they are tested and used in real-world scenarios like medicine (Cai et al., 2019; Kröll et al., 2020) and meteorology (Ebert-Uphoff and Hilburn, 2020).

We here focus on the popular post-hoc explanation method (or interpretability method) of feature visualizations via activation maximization111Also known as input maximization or maximally exciting images.. First introduced by Erhan et al. (2009) and subsequently improved by many others (Mahendran and Vedaldi, 2015; Nguyen et al., 2015; Mordvintsev et al., 2015; Nguyen et al., 2016a, 2017), these synthetic, maximally activating images seek to visualize features that a specific network unit, feature map or a combination thereof is selective for. However, feature visualizations are surrounded by a great controversy: How accurately do they represent a CNN’s inner workings—or in short, how useful are they? This is the guiding question of our study.

On the one hand, many researchers are convinced that feature visualizations are interpretable (Graetz, 2019) and that “features can be rigorously studied and understood” (Olah et al., 2020b)

. Also other applications from Computer Vision and Natural Language Processing support the view that features are meaningful

(Mikolov et al., 2013; Karpathy et al., 2015; Radford et al., 2017; Zhou et al., 2014; Bau et al., 2017, 2020). As such, a particular milestone for CNNs was understanding that features are formed in a hierarchical fashion (LeCun et al., 2015; Güçlü and van Gerven, 2015; Goodfellow et al., 2016). Over the past few years, extensive investigations to better understand CNNs are based on feature visualizations (Olah et al., 2020b, a; Cammarata et al., 2020; Cadena et al., 2018), and the technique is being combined with other explanation methods (Olah et al., 2018; Carter et al., 2019; Addepalli et al., 2020; Hohman et al., 2019).

On the other hand, feature visualizations can be equal parts art and engineering as they are science: vanilla methods look noisy, thus human-defined regularization mechanisms are introduced. But do the resulting beautiful visualizations accurately show what a CNN is selective for? How representative are the seemingly well-interpretable, “hand-picked” (Olah et al., 2017) synthetic images in publications for the entirety of all units in a network, a concern raised by e.g. Kriegeskorte (2015)? What if the features that a CNN is truly sensitive to are imperceptible instead, as might be suggested by the existence of adversarial examples (Szegedy et al., 2013; Ilyas et al., 2019)? Morcos et al. (2018) even suggest that units of easily understandable features play a less important role in a network. Another criticism of synthetic maximally activating images is that they only visualize extreme features, while potentially leaving other features undetected that only elicit e.g.  of the maximal activation. Also, polysemantic units (Olah et al., 2020b), i.e. units that are highly activated by different semantic concepts, as well as the importance of combinations of units (Olah et al., 2017, 2018; Fong and Vedaldi, 2018) already hint at the complexity of how concepts are encoded.

One way to advance this debate is to measure the utility of feature visualizations in terms of their helpfulness for humans. In this study, we therefore design well-controlled psychophysical experiments that aim to quantify the informativeness of the popular visualization method by Olah et al. (2017). Specifically, participants choose which of two natural images would elicit a higher activation in a CNN given a set of reference images that visualize the network selectivities. We use natural query images because real-world applications of XAI require understanding model decisions to natural inputs. To the best of our knowledge, our study is the first to probe how well humans can predict intermediate CNN activations. Our data shows that:

  • Synthetic images provide humans with helpful information about feature map activations.

  • Exemplary natural images are even more helpful.

  • The superiority of natural images holds across the network and various conditions.

  • Subjective impressions of the interpretability of the synthetic visualizations vary greatly between participants.

2 Related Work

Significant progress has been made in recent years towards understanding CNNs for image data. Here, we mention a few selected methods as examples of the plethora of approaches for understanding CNN decision-making: Saliency maps show the importance of each pixel to the classification decision (Springenberg et al., 2014; Bach et al., 2015; Smilkov et al., 2017; Zintgraf et al., 2017),

concept activation vectors

show a model’s sensitivity to human-defined concepts (Kim et al., 2018), and other methods - amongst feature visualizations - focus on explaining individual units Bau et al. (2020). Some tools integrate interactive, software-like aspects (Hohman et al., 2019; Wang et al., 2020; Carter et al., 2019; Collaris and van Wijk, 2020; OpenAI, 2020), combine more than one explanation method (Shi et al., 2020; Addepalli et al., 2020) or make progress towards automated explanation methods (Lapuschkin et al., 2019; Ghorbani et al., 2019b). As overviews, we recommend Zhang and Zhu (2018); Montavon et al. (2018) and Samek et al. (2020).

Despite their great insights, challenges for explanation methods remain. Oftentimes, these techniques are criticized as being over-engineered; regarding feature visualizations, this concerns the loss function and techniques to make the synthetic images look interpretable. For purely decision-level explanation methods like saliency maps (that do not aim to explain intermediate representations), studies have shown that the putative explanations can be similar for trained and randomized nets

(Adebayo et al., 2018), that they can be affected by meaningless data pre-processing steps (Kindermans et al., 2017), and that they are vulnerable to adversarial attacks (Ghorbani et al., 2019a).

In order to further advance XAI, scientists advocate different directions. Besides the focus on developing additional methods, some researchers (e.g. Olah et al. (2020b)) promote the “natural science” approach, i.e. studying a neural network extensively and making empirical claims until falsification. Yet another direction is to quantitatively evaluate explanation methods. So far, only decision-level explanation methods have been studied in this regard. Quantitative evaluations can either be realized with humans directly or with mathematically-grounded models as an approximation for human perception. Many of the latter approaches show great insights (e.g. Hooker et al. (2019); Nguyen and Martínez (2020); Fel and Vigouroux (2020); Lin et al. (2020); Tritscher et al. (2020); Tjoa and Guan (2020)). However, a recent study demonstrates that metrics of the explanation quality computed without human judgment are inconclusive and do not correspond to the human rankings (Biessmann and Refiano, 2019). Additionally, Miller (2019) emphasizes that XAI should build on existing research in philosophy, cognitive science and social psychology.

Figure 2: Example trial in psychophysical experiments. A participant is shown minimally and maximally activating reference images for a certain feature map on the sides and is asked to select the image from the center that also strongly activates that feature map. The answer is given by clicking on the number according to the participant’s confidence level (: not confident, : somewhat confident, : very confident). After each trial, the participant receives feedback which image was indeed the maximally activating one. For screenshots of each step in the task, see Appendix Fig. 7.

The body of literature on human evaluations of explanation methods is growing: Various combinations of data types (tabular, text, static images), task set-ups and participant pools (experts vs. laypeople, on-site vs. crowd-sourcing) are being explored. However, these studies all aim to investigate final model decisions and do not probe intermediate activations like our experiments do. For a detailed table of related studies, see Appendix Sec. A.3. A commonly employed task paradigm is the “forward simulation / prediction” task, first introduced by Doshi-Velez and Kim (2017): Participants guess the model’s computation based on an input and an explanation. As there is no absolute metric for the goodness of explanation methods (yet), comparisons are always performed within studies, typically against baselines. The same holds for additional data collected for confidence or trust ratings. Unfortunately, how differences between conditions should be interpreted is often unclear. According to the current literature, studies reporting positive effects of explanations (e.g. Kumarakulasinghe et al. (2020)) slightly outweigh those reporting inconclusive (e.g. Alufaisan et al. (2020); Chu et al. (2020)) or even negative effects (e.g. Shen and Huang (2020)).

To our knowledge, no study has yet evaluated the popular explanation method of feature visualizations and how it improves human understanding of intermediate network activations. This study therefore closes an important gap: By presenting data for a forward prediction task of a CNN, we provide a quantitative estimate of the informativeness of maximally activating images. Furthermore, our data is unique as it probes for the first time how well humans can predict

intermediate model activations.

3 Methods

We perform two human psychophysical studies222Code is available at https://github.com/bethgelab/testing_visualizations. with different foci (Experiment I () and Experiment II ()). In both studies, the task is to choose the one image out of two natural query images (two-alternative forced choice paradigm) that the participant considers to be a maximally activating image given some reference images (see Fig. 2). Apart from image choices, we record participants’ confidence levels and their reaction times. Specifically, responses are given by clicking on the confidence levels belonging to either query image. In order to gain insights on how intuitive participants find feature visualizations, their subjective judgments are collected in a separate task and a dynamic conversation after the experiment (for details, see Appendix Sec. A.1.1 and Appendix Sec. A.2.5).

All design choices are made with two main goals: (1) allowing participants to achieve the best performance possible to approximate an upper bound on the helpfulness of the explanation method, and (2) gaining a general impression of the helpfulness of the examined method. As an example, we choose the natural query images from among those of lowest and highest activations ( best possible performance) and test many different feature maps across the network ( generality). For more details on the human experiment besides the ones below, see Appendix Sec. A.1.

In Experiment I, we focus on comparing the performance of synthetic images to two baseline conditions: natural reference images and no reference images. In Experiment II, we compare lay vs. expert participants as well as different presentation schemes of reference images. Expert participants qualify by being familiar with CNNs. Regarding presentation schemes, we vary whether only maximally or both maximally and minimally activating images are shown; as well as how many example images of each of these are presented ( or ).

Following the existing work on feature visualization (Olah et al., 2017, 2018, 2020b, 2020a), we use an Inception V1 network333also known as GoogLeNet (Szegedy et al., 2015)

trained on ImageNet

(Deng et al., 2009; Russakovsky et al., 2015). The synthetic images throughout this study are the optimization results of the feature visualization method by Olah et al. (2017)

. The natural stimuli are selected from the validation set of the ImageNet ILSVRC 2012

(Russakovsky et al., 2015) dataset according to their activations for the feature map of interest.

4 Results

In this section, all figures show data from Experiment I except Fig. 5A+C (Experiment II). All figures for Experiment II, which replicate the findings of Experiment I, as well as additional figures for Experiment I, can be found in the Appendix Sec. A.2

. Note that (unless explicitly noted otherwise), error bars denote two standard errors of the mean of the participant average metric.

4.1 Participants are better, more confident and faster with natural images

Synthetic images can be helpful: Given synthetic reference images generated via feature visualization (Olah et al., 2017), participants are able to predict whether a certain network feature map prefers one over the other query image with an accuracy of , which is well above chance level () (see Fig. 3A). However, performance is even higher in what we intended to be the baseline condition: natural reference images (). Additionally, for correct answers, participants much more frequently report being highly certain on natural relative to synthetic trials (see Fig. 3B), and their average reaction time is approximately seconds faster when seeing natural than synthetic reference images (see Fig. 3C). Taken together, these findings indicate that in our setup, participants are not just better overall, but also more confident and substantially faster on natural images.

Figure 3: Participants are better, more confident and faster at judging which of two query images causes higher unit activation with natural than with synthetic reference images. A: Performance. Given synthetic reference images, participants are well above chance (proportion correct: ), but even better for natural reference images (). Without reference images (baseline comparison “None”), participants are close to chance. B: Confidence. Participants are much more confident (higher rating = more confident) for natural than for synthetic images on correctly answered trials (, ). C: Reaction time. For correctly answered trials, participants are on average faster when presented with natural than with synthetic reference images. We show additional plots on confidence and reaction time for incorrectly answered trials and all trials in the Appendix (Fig. 15); for Experiment II, see Fig. 16.). The -values in A and C correspond to Wilcoxon signed-rank tests.

4.2 Natural images are more helpful across a broad range of layers

Next, we take a more fine-grained look at performance across different layers and branches of the Inception modules (see Fig. 

4). Generally, feature map visualizations from lower layers show low-level features such as striped patterns, color or texture, whereas feature map visualizations from higher layers tend to show more high-level concepts like (parts of) objects (LeCun et al., 2015; Güçlü and van Gerven, 2015; Goodfellow et al., 2016). We find performance to be reasonably high across most layers and branches: participants are able to match both low-level and high-level patterns (despite not being explicitly instructed what layer a feature map belonged to). Again, natural images are mostly more helpful than synthetic images.

Figure 4: Performance is high across (A) a broad range of layers and (B) all branches of the Inception modules. The latter differ in their kernel sizes (, , , pool). Again, natural images are (mostly) more helpful than synthetic images. Additional plots for the none condition as well as Experiment II can be found in the Appendix in Fig. 17 and Fig. 18.

4.3 For expert and lay participants alike: natural images are more helpful

Explanation methods seek to explain aspects of algorithmic decision-making. Importantly, an explanation should not just be amenable to experts but to anyone affected by an algorithm’s decision. We here test whether the explanation method of feature visualization is equally applicable to expert and lay participants (see Fig. 5A). Contrary to our prior expectation, we find no significant differences in expert vs. lay performance (RM ANOVA, , for details see Appendix Sec. A.2.2). Hence, extensive experience with CNNs is not necessary to perform well in this forward simulation task. In line with the previous main finding, both experts and lay participants are both better in the natural than in the synthetic condition.

4.4 Even for hand-picked feature visualizations, performance is higher on natural images

Often, explanation methods are presented using carefully selected network units, raising the question whether author-chosen units are representative for the interpetability method as a whole. Olah et al. (2017) identify a number of particularly interpretable feature maps in Inception V1 in their appendix overview. When presenting either these hand-picked visualizations444All our hand-picked feature maps are taken from the pooling branch of the Inception module. As the appendix overview in Olah et al. (2017) does not contain one feature map for each of these, we select interpretable feature maps for the missing layer mixed5a and mixed5b ourselves. or randomly selected ones, performance for hand-picked feature maps improves slightly (Fig. 5B); however this performance difference is small and not significant for both natural (Wilcoxon test, ) and synthetic (Wilcoxon test, ) reference images (see Appendix Sec. A.2.3 for further analysis). Consistent with the findings reported above, performance is higher for natural than for synthetic reference images even on carefully selected hand-picked feature maps.

4.5 Additional information boosts performance, especially for natural images

Publications on feature visualizations vary in terms of how optimized images are presented: Rarely, a single maximally activating image is shown (e.g. Erhan et al. (2009); Carter et al. (2019); Olah et al. (2018)); sometimes a few images are shown simultaneously (e.g. Yosinski et al. (2015); Nguyen et al. (2016b)), and on occasion both maximally and minimally activating images are shown in unison (Olah et al. (2017)). Naturally, the question arises as to what influence (if any) these choices have, and whether there is an optimal way of presenting activating images. For this reason, we systematically compare approaches along two dimensions: the number of reference images ( vs. ) and the availability of minimally activating images (only Max vs. Min+Max). The results can be found in Fig. 5C. When just a single maximally activating image is presented (condition Max 1), natural images already outperform synthetic images ( vs. ). With additional information along either dimension, performance improves both for natural as well as for synthetic images. The stronger boost in performance, however, is observed for natural reference images. In fact, performance is higher for natural than for synthetic reference images in all four conditions. In the Min+Max 9 condition, a replication of the result from Experiment I shown in Fig. 3A, natural images now outperform synthetic images by an even larger margin ( vs. %).

Figure 5: We found no evidence for large effects of expert level or feature map selection. However, performance does improve with additional information. A: Expert level. Both experts and lay participants perform equally well (RM ANOVA, ), and consistently better on natural than on synthetic images. B: Selection mode. There is no significant performance difference between hand-picked feature maps selected for interpretability and randomly selected ones (Wilcoxon test, for synthetic and for natural reference images). C: Presentation scheme. Presenting both maximally and minimally activating images simultaneously (Min+Max) and presenting nine instead of one single reference image tend to improve performance, especially for natural reference images. “ns” highlights non-significant differences.

4.6 Subjectively, interpretability of feature visualizations varies greatly

While our data suggests that feature visualizations are indeed helpful for humans to predict CNN activations, we want to emphasize again that our design choices aim at an upper bound on their informativeness. Another important aspect of evaluating an explanation method is the subjective impression. Besides recording confidence ratings and reaction times, we collect judgments on intuitiveness trials (see Appendix Fig. 13) and oral impressions after the experiments. The former ask for ratings of how intuitive feature visualizations appear for natural images. As Fig. 6A+B show, participants perceive the intuitiveness of synthetic feature visualizations for strongly activating natural dataset images very differently. Further, the comparison of intuitiveness judgments before and after the main experiments reveals only a small significant average improvement for one out of three feature maps (see Fig. 6B+C, Wilcoxon test, ). The interactive conversations paint a similar picture: Some synthetic feature visualizations are perceived as intuitive while others do not correspond to understandable concepts. Nonetheless, four participants report that their first “gut feeling” for interpreting these reference images (as one participant phrased it) is more reliable. A few participants point out that the synthetic visualizations are exhausting to understand. Three participants additionally emphasize that the minimally activating reference images played an important role in their decision-making.

Figure 6: The subjective intuitiveness of feature visualizations varies greatly (see A for the ratings from the beginning of Experiment I and B for the ratings at the beginning and end of Experiment II). The means over all subjects yield a neutral result, i.e. the visualizations are neither un- nor intuitive, and the improvement of subjective intuitiveness before and after the experiment is only significant for one feature map (mixed4b). C: On average, participants found feature visualizations slightly more intuitive after doing the experiment as the differences larger than zero show. In all three subfigures, gray dots and lines show data per participant.

5 Discussion & Conclusion

Feature visualizations such as synthetic maximally activating images are a widely used explanation method, but it is unclear whether they indeed help humans to understand CNNs. Using well-controlled psychophysical experiments with both expert and lay participants, we here conduct the very first investigation of intermediate synthetic feature visualizations: Can participants predict which of two query images leads to a strong activation in a feature map, given extremely activating visualizations? Specifically, we shed light on the following questions:

(1.) How much more informative are synthetic feature visualizations compared to a natural image baseline? We find above-chance performance given synthetic feature visualizations, but to our own surprise, synthetic feature visualizations are systematically less informative than the simple baseline of natural strongly activating images. Interestingly, many synthetic feature visualizations contain regularization mechanisms to introduce more “natural structure” (Olah et al., 2017), sometimes even called a “natural image prior” (Mahendran and Vedaldi, 2015; Offert and Bell, 2020). This raises the question: Are natural images maybe all you need? One might posit that highly-activating natural (reference) images simply appear more similar to other highly-activating natural (query) images. While that might be true, feature visualizations are meant to explain feature map activations for natural images, and this is ultimately what real-world applications of XAI are concerned with.

(2.) Do you need to be a CNN expert in order to understand feature visualizations? To the best of our knowledge, our study is the first to compare the performances of expert and lay people when evaluating explanation methods. Previously, publications either focused on only expert groups (Hase and Bansal, 2020; Kumarakulasinghe et al., 2020) or only laypeople (Schmidt and Biessmann, 2019; Alufaisan et al., 2020). Our experiment shows no significant difference between expert and lay participants in our task—both perform similarly well, and even better on natural images: a replication of our main finding. While a few caveats remain when moving an experiment from the well-controlled lab to a crowdsourcing platform (Haghiri et al., 2019), this suggests that future studies may not have to rely on selected expert participants, but may leverage larger lay participant pools.

(3.) Are hand-picked synthetic feature visualizations representative? An open question was whether the visualizations shown in publications represent the general interpretability of feature visualizations (a concern voiced by e.g. Kriegeskorte, 2015), even though they are hand-picked (Olah et al., 2017). Our finding that there is no large difference in performance between hand- and randomly-picked feature visualizations suggests that this aspect is minor.

(4.) What is the best way of presenting images? Existing work suggested that more than one example (Offert, 2017) and particularly negative examples (Kim et al., 2016) enhance human understanding of data distributions. Our systematic exploration of presentation schemes provides evidence that increasing the number of reference images as well as presenting both minimally and maximally activating reference images (as opposed to only maximally activating ones) improve human performance. This finding might be of interest to future studies aiming at peak performance or for developing software for understanding CNNs.

(5.) How do humans subjectively perceive feature visualizations? Apart from the high informativeness of explanations, another relevant question is how much trust humans have in them. In our experiment, we find that subjective impressions of how reasonable synthetic feature visualizations are for explaining responses to natural images vary greatly. This finding is in line with Hase and Bansal (2020) who evaluated explanation methods on text and tabular data.

Caveats. Despite our best intentions, a few caveats remain: The forward simulation paradigm is only one specific way to measure the informativeness of explanation methods. Further, we emphasize that all experimental design choices were made with the goal to measure the best possible performance. As a consequence, our finding that synthetic reference images help humans predict a network’s strongly activating image may not necessarily be representative of a less optimal experimental set-up with e.g. query images corresponding to less extreme feature map activations. Knobs to further de- or increase participant performance remain (e.g. hyper-parameter choices could be tuned to layers). Finally, while we explored one particular method in depth (Olah et al., 2017); it remains an open question whether the results can be replicated for other feature visualizations methods.

Future directions. We see many promising future directions. For one, the current study uses query images from extreme opposite ends of a feature map’s activation spectrum. For a more fine-grained measure of informativeness, we will study query images that elicit more similar activations. Additionally, future participants could be provided with even more information—such as, for example, where a feature map is located in the network. Furthermore, it has been suggested that the combination of synthetic and natural reference images might provide synergistic information to participants (Olah et al., 2017)

, which could again be studied in our experimental paradigm. Finally, further studies could explore single neuron-centered feature visualizations, combinations of units as well as different network architectures.

Taken together, our results highlight the need for thorough human quantitative evaluations of feature visualizations and suggest that example natural images provide a surprisingly challenging baseline for understanding CNN activations.

Author Contributions

The initiative of investigating human predictability of CNN activations came from WB. JB, WB, MB and TSAW jointly combined it with the idea of investigating human interpretability of feature visualizations. JB led the project. JB, RSZ and JS jointly designed and implemented the experiments (with advice and feedback from RG, TSAW, MB and WB). The data analysis was performed by JB and RSZ (with advice and feedback from RG, TSAW, MB and WB). JB designed, and JB and JS implemented the pilot study. JB conducted the experiments (with help from JS). RSZ performed the statistical significance tests (with advice from TSAW and feedback from JB and RG). MB helped shape the bigger picture and initiated intuitiveness trials. WB provided day-to-day supervision. JB, RSZ and RG wrote the initial version of the manuscript. All authors contributed to the final version of the manuscript.


We thank Felix A. Wichmann and Isabel Valera for helpful discussions. We further thank Alexander Böttcher and Stefan Sietzen for support as well as helfpul discussions on technical details. Additionally, we thank Chris Olah for clarifications via http://slack.distill.pub/. From our lab, we thank Matthias Kümmerer, Matthias Tangemann, Evgenia Rusak and Ori Press for helping in piloting our experiments, as well as feedback from Evgenia Rusak, Claudio Michaelis, Dylan Paiton and Matthias Kümmerer. And finally, we thank all our participants for taking part in our experiments.

We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting JB, RZ and RG. We acknowledge support from the German Federal Ministry of Education and Research (BMBF) through the Competence Center for Machine Learning (TUE.AI, FKZ 01IS18039A) and the Bernstein Computational Neuroscience Program Tübingen (FKZ: 01GQ1002), the Cluster of Excellence Machine Learning: New Perspectives for Sciences (EXC2064/1), and the German Research Foundation (DFG; SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP3, project number 276693517).


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §A.1.2.
  • S. Addepalli, D. Tamboli, R. V. Babu, and B. Banerjee (2020) Saliency-driven class impressions for feature visualization of deep neural networks. arXiv preprint arXiv:2007.15861. Cited by: §1, §2.
  • J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9505–9515. Cited by: §2.
  • Y. Alufaisan, L. R. Marusich, J. Z. Bakdash, Y. Zhou, and M. Kantarcioglu (2020) Does explainable artificial intelligence improve human decision-making?. arXiv preprint arXiv:2006.11194. Cited by: §2, §5.
  • S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015)

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

    PloS one 10 (7), pp. e0130140. Cited by: §2.
  • D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6541–6549. Cited by: §1, footnote 6.
  • D. Bau, J. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, and A. Torralba (2020) Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences. Cited by: §1, §2.
  • F. Biessmann and D. I. Refiano (2019) A psychophysics approach for quantitative comparison of interpretable computer vision models. arXiv preprint arXiv:1912.05011. Cited by: §2.
  • S. A. Cadena, M. A. Weis, L. A. Gatys, M. Bethge, and A. S. Ecker (2018) Diverse feature visualizations reveal invariances in early layers of deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 217–232. Cited by: §1.
  • C. J. Cai, E. Reif, N. Hegde, J. Hipp, B. Kim, D. Smilkov, M. Wattenberg, F. Viegas, G. S. Corrado, M. C. Stumpe, et al. (2019) Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–14. Cited by: §1.
  • N. Cammarata, G. Goh, S. Carter, L. Schubert, M. Petrov, and C. Olah (2020) Curve detectors. Distill. Note: https://distill.pub/2020/circuits/curve-detectors External Links: Document Cited by: §1.
  • S. Carter, Z. Armstrong, L. Schubert, I. Johnson, and C. Olah (2019) Activation atlas. Distill. Note: https://distill.pub/2019/activation-atlas External Links: Document Cited by: §1, §2, §4.5.
  • E. Chu, D. Roy, and J. Andreas (2020) Are visual explanations useful? a case study in model-in-the-loop prediction. arXiv preprint arXiv:2007.12248. Cited by: §2.
  • D. Collaris and J. J. van Wijk (2020) ExplainExplore: visual exploration of machine learning explanations. In 2020 IEEE Pacific Visualization Symposium (PacificVis), pp. 26–35. Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §A.1.2, §3.
  • F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §A.1.1, §1, §2.
  • I. Ebert-Uphoff and K. Hilburn (2020) Evaluation, tuning and interpretation of neural networks for working with images in meteorological applications. Bulletin of the American Meteorological Society, pp. 1–49. Cited by: §1.
  • D. Erhan, Y. Bengio, A. Courville, and P. Vincent (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §1, §4.5.
  • T. Fel and D. Vigouroux (2020) Representativity and consistency measures for deep neural network explanations. arXiv preprint arXiv:2009.04521. Cited by: §2.
  • R. Fong and A. Vedaldi (2018) Net2vec: quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8730–8738. Cited by: §1.
  • GDPR (2016) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46. Official Journal of the European Union (OJ) 59 (1-88), pp. 294. Cited by: §1.
  • A. Ghorbani, A. Abid, and J. Zou (2019a) Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3681–3688. Cited by: §2.
  • A. Ghorbani, J. Wexler, J. Y. Zou, and B. Kim (2019b) Towards automatic concept-based explanations. In Advances in Neural Information Processing Systems, pp. 9277–9286. Cited by: §2.
  • L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal (2018) Explaining explanations: an overview of interpretability of machine learning. In

    2018 IEEE 5th International Conference on data science and advanced analytics (DSAA)

    pp. 80–89. Cited by: §1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §1, §4.2.
  • B. Goodman and S. Flaxman (2017) European Union regulations on algorithmic decision-making and a “right to explanation”. AI magazine 38 (3), pp. 50–57. Cited by: §1.
  • F. M. Graetz (2019) How to visualize convolutional features in 40 lines of code. External Links: Link Cited by: §1.
  • U. Güçlü and M. A. van Gerven (2015) Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience 35 (27), pp. 10005–10014. Cited by: §1, §4.2.
  • S. Haghiri, P. Rubisch, R. Geirhos, F. Wichmann, and U. von Luxburg (2019) Comparison-based framework for psychophysics: lab versus crowdsourcing. arXiv preprint arXiv:1905.07234. Cited by: §5.
  • P. Hase and M. Bansal (2020) Evaluating explainable ai: which algorithmic explanations help users predict model behavior?. arXiv preprint arXiv:2005.01831. Cited by: §5, §5.
  • F. Hohman, H. Park, C. Robinson, and D. H. P. Chau (2019) Summit: scaling deep learning interpretability by visualizing activation and attribution summarizations. IEEE transactions on visualization and computer graphics 26 (1), pp. 1096–1106. Cited by: §1, §2.
  • S. Hooker, D. Erhan, P. Kindermans, and B. Kim (2019) A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, pp. 9737–9748. Cited by: §2.
  • A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pp. 125–136. Cited by: §1.
  • JASP Team (2020) JASP (Version 0.13.1). External Links: Link Cited by: §A.1.3.
  • A. Karpathy, J. Johnson, and L. Fei-Fei (2015) Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078. Cited by: §1.
  • B. Kim, R. Khanna, and O. O. Koyejo (2016) Examples are not enough, learn to criticize! criticism for interpretability. In Advances in neural information processing systems, pp. 2280–2288. Cited by: §5.
  • B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al. (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp. 2668–2677. Cited by: §1, §2.
  • P. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, and B. Kim (2017) The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867. Cited by: §2.
  • N. Kriegeskorte (2015) Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual review of vision science 1, pp. 417–446. Cited by: §1, §5.
  • J. Kröll, S. B. Eickhoff, F. Hoffstaedter, and K. R. Patil (2020) Evolving complex yet interpretable representations: application to alzheimer’s diagnosis and prognosis. In

    2020 IEEE Congress on Evolutionary Computation (CEC)

    pp. 1–8. Cited by: §1.
  • N. B. Kumarakulasinghe, T. Blomberg, J. Liu, A. S. Leao, and P. Papapetrou (2020) Evaluating local interpretable model-agnostic explanations on clinical machine learning classification models. In 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pp. 7–12. Cited by: §2, §5.
  • S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K. Müller (2019) Unmasking clever hans predictors and assessing what machines really learn. Nature communications 10 (1), pp. 1–8. Cited by: §2.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1, §4.2.
  • Y. Lin, W. Lee, and Z. B. Celik (2020) What do you see? evaluation of explainable artificial intelligence (xai) interpretability through neural backdoors. arXiv preprint arXiv:2009.10639. Cited by: §2.
  • Z. C. Lipton (2018) The mythos of model interpretability. Queue 16 (3), pp. 31–57. Cited by: §1.
  • A. Mahendran and A. Vedaldi (2015) Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196. Cited by: §1, §5.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
  • T. Miller (2019) Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. Cited by: §1, §2.
  • G. Montavon, W. Samek, and K. Müller (2018) Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1–15. Cited by: §2.
  • A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick (2018) On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Cited by: §1.
  • A. Mordvintsev, C. Olah, and M. Tyka (2015) Inceptionism: going deeper into neural networks. Cited by: §1.
  • W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu (2019) Interpretable machine learning: definitions, methods, and applications. arXiv preprint arXiv:1901.04592. Cited by: §1.
  • A. Nguyen and M. R. Martínez (2020) On quantitative aspects of model interpretability. arXiv preprint arXiv:2007.07584. Cited by: §2.
  • A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski (2017) Plug & play generative networks: conditional iterative generation of images in latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4467–4477. Cited by: §1.
  • A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune (2016a) Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in neural information processing systems, pp. 3387–3395. Cited by: §1.
  • A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §1.
  • A. Nguyen, J. Yosinski, and J. Clune (2016b) Multifaceted feature visualization: uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616. Cited by: §4.5.
  • F. Offert and P. Bell (2020) Perceptual bias and technical metapictures: critical machine vision as a humanities challenge. AI & SOCIETY, pp. 1–12. Cited by: §5.
  • F. Offert (2017) ” I know it when i see it”. visualization and intuitive interpretability. arXiv preprint arXiv:1711.08042. Cited by: §5.
  • C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter (2020a) An overview of early vision in inceptionv1. Distill. Note: https://distill.pub/2020/circuits/early-vision External Links: Document Cited by: §A.1.2, §1, §3.
  • C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter (2020b) Zoom in: an introduction to circuits. Distill. Note: https://distill.pub/2020/circuits/zoom-in External Links: Document Cited by: §A.1.2, §1, §1, §2, §3, footnote 6.
  • C. Olah, A. Mordvintsev, and L. Schubert (2017) Feature visualization. Distill 2 (11), pp. e7. Cited by: §A.1.1, §A.1.1, §A.1.2, §A.1.2, Exemplary natural images explain CNN activations better than feature visualizations , Figure 1, §1, §1, §3, §4.1, §4.4, §4.5, §5, §5, §5, §5, footnote 4.
  • C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mordvintsev (2018) The building blocks of interpretability. Distill 3 (3), pp. e10. Cited by: §A.1.2, §1, §1, §3, §4.5, footnote 6.
  • OpenAI (2020) OpenAI Microscope. Note: https://microscope.openai.com/models(Accessed on 09/12/2020) Cited by: §2, footnote 6.
  • J. Peirce, J. R. Gray, S. Simpson, M. MacAskill, R. Höchenberger, H. Sogo, E. Kastman, and J. K. Lindeløv (2019) PsychoPy2: experiments in behavior made easy. Behavior research methods 51 (1), pp. 195–203. Cited by: §A.1.1.
  • A. Radford, R. Jozefowicz, and I. Sutskever (2017) Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Cited by: §1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ”Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §A.1.2, §A.1.2, §3.
  • W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, and K. Müller (2020) Toward interpretable machine learning: transparent deep neural networks and beyond. arXiv preprint arXiv:2003.07631. Cited by: §1, §2.
  • P. Schmidt and F. Biessmann (2019) Quantifying interpretability and trust in machine learning systems. arXiv preprint arXiv:1901.08558. Cited by: §5.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §1.
  • H. Shen and T. Huang (2020) How useful are the machine-generated interpretations to general users? a human evaluation on guessing the incorrectly predicted labels. arXiv preprint arXiv:2008.11721. Cited by: §2.
  • R. Shi, T. Li, and Y. Yamaguchi (2020) Group visualization of class-discriminative features. Neural Networks. Cited by: §2.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §2.
  • J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §A.1.2, §3.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  • E. Tjoa and C. Guan (2020) Quantifying explainability of saliency methods in deep neural networks. arXiv preprint arXiv:2009.02899. Cited by: §2.
  • J. Tritscher, M. Ring, D. Schlr, L. Hettinger, and A. Hotho (2020) Evaluation of post-hoc xai approaches through synthetic tabular data. In International Symposium on Methodologies for Intelligent Systems, pp. 422–430. Cited by: §2.
  • Z. J. Wang, R. Turko, O. Shaikh, H. Park, N. Das, F. Hohman, M. Kahng, and D. H. Chau (2020) CNN explainer: learning convolutional neural networks with interactive visualization. arXiv preprint arXiv:2004.15004. Cited by: §2.
  • J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson (2015) Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. Cited by: §4.5.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1.
  • Q. Zhang and S. Zhu (2018) Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 27–39. Cited by: §2.
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2014) Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856. Cited by: §1.
  • L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling (2017) Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595. Cited by: §2.

Appendix A Appendix

a.1 Details on methods

a.1.1 Human Experiments

In our two human psychophysical studies, we ask humans to predict a feature map’s maximally activating image (“forward simulation task” (Doshi-Velez and Kim, 2017)). Answers to the two-alternative forced choice paradigm are recorded together with the participants’ confidence level (: not confident, : somewhat confident, : very confident, see Fig. 7). Time per trial is unlimited and we record reaction time. After each trial, feedback is given (see Fig. 7). A progress bar at the bottom of the screen indicates how many trials of a block are already completed. As reference images, either synthetic, natural or no reference images are given. Trials of different reference images are arranged in blocks. Synthetic and natural reference images are alternated, and, in the case of Experiment I, framed by trials without reference images (see Fig. 8A, B). The order of the reference image types is counter-balanced across subjects.

(a) Screen at the beginning of a trial. The question is which of the two natural images at the center of the screen also strongly activates the CNN feature map given the reference images on the sides.
(b) Screen including a subject’s answer visualized by black boxes around the image and the confidence level. A subject indicates which natural image in the center would also be a strongly activating image by clicking on the number corresponding to his/her confidence level (1: not confident, 2: somewhat confident, 3: confident). The time until a participant selects an answer is recorded (“reaction time”).
(c) Screen including a subject’s answer (black boxes) and feedback on which image is indeed also a strongly activating image (green box).
Figure 7: Forward Simulation Task. The progress bar at the bottom of the screen indicates the progress within one block of trials.

The main trials in the experiments are complemented by practice, catch and intuitiveness trials. To avoid learning effects, we use different feature maps for each trial type per participant. Specifically, practice trials give participants time to familiarize themselves with the task. In order to monitor the attention of participants, catch trials appear throughout blocks of main trials. Here, the query images are a copy of one of the reference images, i.e., there is an obvious correct answer (see Fig. 14). This control mechanism allows us to decide whether trial blocks should be excluded from the analysis due to e.g. fatigue. To obtain the participant’s subjective impression of the helpfulness of maximally activating images, the experiments are preceded (and also succeeded in the case of Experiment II) by three intuitiveness trials (see Fig. 13). Here, participants judge in a slightly different task design how intuitive they consider the synthetic stimuli for the natural stimuli. For more details on the intuitiveness trials, see below.

At the end of the experiment, all expert participants in Experiment I and all lay (but not expert) participants in Experiment II are asked about their strategy and whether it changed over time. The information gained through the first group allows to understand the variety of cues used and paves the way to identify interesting directions for follow-up experiments. The information gained through the second group allowed comparisons to experts’ impressions reported in Experiment I.

Experiment I

The first experiment focuses on comparing performance of synthetic images to two baselines: natural reference images and no reference images (see Fig. 8A). Screenshots of trials are shown in Fig. 11. In total, feature maps are tested: of these are uniformly sampled from the feature maps of each of the four branches for each of the nine Inception modules. The other nine feature maps are uniformly hand-picked for interpretability from the Inception modules’ pooling branch based on the appendix overview selection provided by Olah et al. (2017) or based on our own choices. In the spirit of a general statement about the explainability method, different participants see different natural reference and query images, and each participant sees different natural query images for the same feature maps in different reference conditions. To check the consistency of participants’ responses, we repeat six randomly chosen main trials for each of the three tested reference image types at the end of the experiment.

Experiment II

The second experiment (see Fig. 8B) is about testing expert vs. lay participants as well as comparing different presentation schemes555In pilot experiments, we learned that participants preferred 9 over 4 reference images, hence this “default” choice in Experiment I. (Max 1, Min+Max 1, Max 9 and Min+Max 9, see Fig. 8E). Screenshots of trials are shown in Fig. 12. In total, feature maps are tested: They are uniformly sampled from every second layer with an Inception module of the network (hence a total of instead of layers), and from all four branches of the Inception modules. Given the focus on four different presentation schemes in this experiment, we repeat the sampling method four times without overlap. In terms of reference image types, only synthetic and natural images are tested. Like in Experiment I, different participants see different natural reference and query images. However, expert and lay participants see the same images. For details on the counter-balancing of all conditions, please refer to Tab. 1.

Figure 8: Detailed structure of the two experiments with different foci. A: Experiment I. Here, the focus is on comparing performance of synthetic and natural reference images to the most simple baseline: no reference images (“None”). To counter-balance conditions, the order of natural and synthetic blocks is alternated across participants. For each of the three reference image types (synthetic, natural and none), 45 relevant trials are used plus additional catch, practice and repeated trials. B: Experiment II. Here, the focus is on testing expert and lay participants as well as comparing different presentation schemes (Max 1, Min+Max 1, Max 9 and Min+Max 9, see E for illustrations). Both the order of natural and synthetic blocks as well as the four presentation conditions are counter-balanced across subjects. To maintain a reasonable experiment length for each participant, only 20 relevant trials are used per reference image type and presentation scheme, plus additional catch and practice trials. C: Legend. D: Number of trials per block type (i.e. reference image type and main vs. practice trial) and experiment. E: Illustration of presentation schemes. In Experiment II, all four schemes are tested, in Experiment I only Min+Max 9 is tested.
Intuitiveness Trials

In order to obtain the participants’ subjective impression of the helpfulness of maximally activating images, we add trials at the beginning of the experiments (and also at the end of Experiment II). The task set-up is slightly different (see Fig. 13): Only maximally activating (i.e. no minimally activating) images are shown. We ask participants to rate how intuitive they find the explanation of the entirety of the synthetic images for the entirety of the natural images. Again, all images presented in one trial are specific to one feature map. By moving a slider to the right (left), participants judge the explanation method as intuitive (not intuitive). The ratings are recorded on a continuous scale from (not intuitive) to (intuitive). All participants see the same three trials in a randomized order. The trials are again taken from the hand-picked (i.e. interpretable) feature maps of the appendix overview in Olah et al. (2017). In theory, this again allows for the highest intuitiveness ratings possible. The specific feature maps are from a low, intermediate and high layer: feature map 43 of mixed3a, feature map 504 of mixed4b and feature map 17 of mixed 5b.


Our two experiments are within-subject studies, meaning that every participant answers trials for all conditions. This design choice allows us to test fewer participants. In Experiment I, expert participants take part ( male, female, age: years, SD = ). In Experiment II, participants take part (of which are experts; male, female, age: years, SD = ). Expert participants qualify by familiarity with convolutional neural networks. All subjects are naive with respect to the aim of the study. Expert (lay) participants are paid 15€ (10 €), per hour for participation. Before the experiment, all subjects give written informed consent for participating. All subjects have normal or corrected to normal vision. All procedures conform to Standard 8 of the American Psychological 405 Association’s “Ethical Principles of Psychologists and Code of Conduct” (2016). Before the experiment, the first author explains the task to each participant and ensures complete understanding. For lay participants, the explanation is simplified: Maximally (minimally) activating images are called “favorite images” (“non-favorite images”) of a “computer program” and the question is explained as which of the two query images would also be a “favorite” image to the computer program.


Stimuli are displayed on a VIEWPixx 3D LCD (VPIXX Technologies; spatial resolution px, temporal resolution ). Outside the stimulus image, the monitor is set to mean gray. Participants view the display from (maintained via a chinrest) in a darkened chamber. At this distance, pixels subtend approximately degrees on average ( per degree of visual angle). Stimulus presentation and data collection is controlled via a desktop computer (Intel Core i5-4460 CPU, AMD Radeon R9 380 GPU) running Ubuntu Linux (16.04 LTS), using PsychoPy (Peirce et al., 2019, version 3.0) under Python 3.6.

a.1.2 Stimuli Selection


Following the existing work on feature visualization (Olah et al., 2017, 2018, 2020b, 2020a), we use an Inception V1 network666This network is considered very interpretable (Olah et al., 2018), yet other work also finds deeper networks more interpretable (Bau et al., 2017). More recent work, again, suggests that “analogous features […] form across models […],” i.e. that interpretable feature visualizations appear “universally” for different CNNs (Olah et al., 2020b; OpenAI, 2020). (Szegedy et al., 2015) trained on ImageNet (Deng et al., 2009; Russakovsky et al., 2015). Note that the Inception V1 network used in previous work slightly deviates from the original network architecture: The branch of Inception module mixed4a only holds instead of feature maps. To stay as close as possible to the aforementioned work, we also use their implementation and trained weights of the network777github.com/tensorflow/lucid/tree/v0.3.8/lucid

. We investige feature visulizations for all branches (i.e. kernel sizes) of the Inception modules and sample from layers mixed3a to mixed5b before the ReLU non-linearity.

Synthethic Images from Feature Visualization

The synthetic images throughout this study are the optimization results of the feature visualization method from Olah et al. (2017). We use the channel objective to find synthetic stimuli that maximally (minimally) activate the spatial mean of a given feature map of the network. We perform the optimization using lucid 0.3.8 and TensorFlow 1.15.0 (Abadi et al., 2015)

and use the hyperparameter as specified in

Olah et al. (2017). For the experimental conditions with more than one minimally/maximally activating reference image, we add a diversity regulariztion across the samples. In hindsight, we realized that we generated synthetic images in Experiment I, even though we only needed and used per feature map.

Selection of natural images

The natural stimuli are selected from the validation set of the ImageNet ILSVRC 2012 (Russakovsky et al., 2015) dataset. To choose the maximally (minimally) activating natural stimuli for a given feature map, we perform three steps, which are illustrated in Fig. 9 and explained in the following: First, we calculate the activation of said feature map for all pre-processed images (resizing to pixels, cropping centrally to pixels and normalizing) and take the spatial average to get a scalar representing the excitability of the given feature map caused by the crop. Second, we order the stimuli according to the collected activation values and select the maximally (respectively minimally) activating images. Here, corresponds to the number of reference images used (either or , see Fig. 8, E) and determines the maximum number of participants we can test with our setup. Third, we distribute the selected stimuli into blocks. Lastly, we create batches of data by randomly choosing one image from each of the blocks for every batch.

The reasons for creating several batches of extremely activating natural images are two-fold: (1) We want to get a general impression of the interpretability method and would like to reduce the dependence on single images, and (2) in Experiment I, a participant has to see different query images in the three different reference conditions. A downside of this design choice is an increase in variability. The precise allocation was done as follows: In Experiment I, the natural query images of the none condition were always allocated the batch with , the query and reference images of the natural condition were allocated the batch with , and the natural query images of the synthetic condition were allocated the batch with . The allocation scheme in Experiment II can be found in Table 1.

Figure 9: Sampling of natural images. A: Distribution of activations. For an example channel (mixed3a, kernel size , feature map ), the smoothed distribution of activations for all ImageNet validation images is plotted. The natural stimuli for the experiment are taken from the tails of the distribution (shaded background). B: Zoomed-in tail of activations distribution. In the presentation schemes with images, bins with images each are created ( because of reference plus query image). C: In order to obtain batches with images each, the images from one bin are randomly distributed to the batches. This guarantees that each batch contains a fair selection of extremely activating images. The query images are always sampled from the most extreme bins in order to give the best signal possible. In the case of the presentation schemes with reference image, the number of bins in B is reduced to and the number of images per batch in C is also reduced to 2.
Order of presentation schemes
(0-3) and batch-blocks (A-D)
Order of synthetic
and natural
Practice Main
1 0 (A) 1 (B) 2 (C) 3 (D) 0
natural: 1
synthetic: 2
natural - synthetic
2 0 (B) 2 (D) 1 (C) 3 (A)
3 3 (B) 1 (D) 2 (A) 0 (C)
4 3 (C) 2 (B) 1 (A) 0 (D)
5 see subject 1-4
natural: 3
synthetic: 4
synthetic - natural
9 see subject 1-4
natural: 5
synthetic: 6
natural - synthetic
see subject 1-4
natural: 7
synthetic: 8
synthetic - natural
Table 1: Counter-balancing of conditions in Experiment II. In total, 13 naive and 10 lay participants are tested. Each “batch block” contains 20 feature maps (sampled from five layers and all Inception module branches). Batches indicate which batch number the natural query (and reference images) are taken from.
Selection of Feature Maps

The selection of feature maps used in Experiment I is shown in Table 2; the selection of feature maps used in Experiment II is shown in Table 3.

Layer Branch Feature Map
mixed3a 25
Pool 227
Pool 230
mixed3b 64
Pool 430
Pool 462
mixed4a 68
Pool 486
Pool 501
mixed4b 45
Pool 491
Pool 465
mixed4c 94
Pool 496
Pool 449
Layer Branch Feature Map
mixed4d 95
Pool 483
Pool 516
mixed4e 231
Pool 816
Pool 809
mixed5a 229
Pool 743
Pool 720
mixed5b 119
Pool 1007
Pool 946
Table 2: Feature maps analyzed in Experiment I. For each of the 9 layers with an Inception module, one randomly chosen feature map per branch (, , and pool) and one additional hand-picked feature map (highlighted with ) are used.
Layer Branch Feature Map for
Batch Block (A-D)
mixed3a 25 14 12 53
189 97 171 106
197 203 212 204
Pool 227 238 232 247
mixed4a 68 33 45 17
257 355 321 200
427 425 429 423
Pool 486 497 478 506
mixed4c 94 53 59 95
247 237 357 209
432 402 400 416
Pool 496 498 473 497
mixed4e 231 83 6 89
524 323 401 373
656 624 642 620
Pool 816 755 724 783
mixed5b 119 14 266 300
684 592 657 481
844 829 839 875
Pool 1007 913 927 903
Table 3: Feature maps analyzed in Experiment II. Four sets of feature maps (batch blocks A to D) are sampled: For every second layer with an Inception module (5 layers in total), one feature map is randomly selected per branch of the Inception module (, , and pool). For the practice, catch and intuitiveness trials additional randomly chosen feature maps are used.
Different activation magnitudes.

We note that the elicited activations of synthetic images are almost always about one magnitude larger than the activations of natural images (see Fig. 10). This constitutes an inherent difference in the synthetic and natural reference image condition.

Figure 10:

Mean activations and standard deviations (not two standard errors of the mean!) of the minimally (below

) and maximally (above ) activating synthetic and natural images used in Experiment I. Note that there are (i.e. accidentally not ) synthetic images and natural images (because of batches) in Experiment I for both minimally and maximally activating images. Please also note that the standard deviations for the selected natural images are invisible because they are so small.

a.1.3 Data Analysis

Significance Tests

All significance tests are performed with JASP (JASP Team, 2020, version 0.13.1). For the analysis of the distribution of confidence ratings (see Fig. 3

B), we use contingency tables with

-tests. For testing pairwise effects in accuracy, confidence, reaction time and intuitiveness data, we report Wilcoxon signed-rank tests with uncorrected p-values (Bonferroni-corrected critical alpha values with family-wise alpha level of

reported in all figures where relevant). These non-parametric tests are preferred for these data because they do not make distributional assumptions like Normally-distributed errors, as in e.g. paired

-tests. For testing marginal effects (main effects of one factor marginalizing over another) we report results from repeated measures ANOVA (RM ANOVA), which does assume Normality.

Figure 11: Experiment I: Example trials of the three reference images conditions: synthetic reference images (first row), natural reference images (second row) or no reference images (third row). The query images in the center are always natural images.
Figure 12: Experiment II: Example trials of the four presentation schemes: Max 1, Min+max 1, Max 9, Min+Max 9. The left column contains synthetic reference images, the right column contains natural reference images.
Figure 13: Trials for intuitiveness judgment. The tested feature maps are from layer mixed3a (channel 43), mixed4b (channel 504) and mixed 5b (channel 17). They are the same in Experiment I and in Experiment II.
Figure 14: Catch trials. An image from the reference images is copied as a query image, which makes the answer obvious. The purpose of these trials is to integrate a mechanism into the experiment which allows us to check post-hoc whether a participant was still paying attention.

a.2 Details on results

a.2.1 Complementing figures for main results

Figures 15 - 19 complement the results and figures presented in Section 4. Here, all experimental conditions are shown.

(a) Performance.
(b) Confidence ratings on correctly answered trials.
(c) Confidence ratings on incorrectly answered trials.
(d) Confidence ratings on all trials.    
(e) Reaction time on correctly answered trials.
(f) Reaction time on incorrectly answered trials.
(g) Reaction time on all trials.    
Figure 15: Task performance (a), distribution of confidence ratings (b-d) and reaction times (e-g) of Experiment I. The -values are calculated with Wilcoxon sign-rank tests. Note that unlike in the main paper, these figures consistently include the “None” condition. For explanations, see Sec. 4.1.
(a) Performance.
(b) Confidence ratings on correctly answered trials.
(c) Confidence ratings on incorrectly answered trials.
(d) Confidence ratings on all trials.
(e) Reaction time on correctly answered trials.
(f) Reaction time on incorrectly answered trials.
(g) Reaction time on all trials.
Figure 16: Task performance (a), distribution of confidence ratings (b-d) and reaction times (e-g) of Experiment II, averaged over expert level and presentation schemes. The -values are calculated with Wilcoxon sign-rank tests. The results replicate our findings of Experiment I. For explanations on the latter, see Sec. 4.1.
(a) Performance across layers.
(b) Performance across branches.
Figure 17: High performance across (a) layers and (b) branches of the Inception modules in Experiment I. Note that unlike in the main paper these figures consistently include the “None” condition. For explanations, see Sec. 4.2.
(a) Performance across layers.
(b) Performance across branches in Inception module.
Figure 18: High performance across (a) layers and (b) branches of the Inception modules in Experiment II. Note that only every second layer is tested here (unlike in Experiment I). The results replicate our findings of Experiment I. For explanations, see Sec. 4.2

a.2.2 Details on performance of expert and lay participants

As reported in the main body of the paper, a mixed-effects ANOVA revealed no significant main effect of expert level (, , between-subjects effect). Further, there is no significant interaction with the reference image type (), and both expert and lay participants show a significant main effect of the reference image type ().

a.2.3 Details on performance of hand- and randomly-picked feature maps

As described in the main body of the paper, pairwise Wilcoxon sign-rank tests reveal no significant differences between hand-picked and randomly-selected feature maps within each reference image type (, for natural reference images and for synthetic references). However, marginalizing over reference image type using a repeated measures ANOVA reveals a significant main effect of the feature map selection mode: . Therefore, while there may be a small effect of hand-picking feature maps, our data indicates that this effect, if present, is small.

a.2.4 Repeated trials

To check the consistency of participants’ responses, we repeat six main trials for each of the three tested reference image types at the end of the experiment. Specifically, the six trials correspond to the three highest and three lowest absolute confidence ratings. Results are shown in Fig. 19. We observe consistency to be high for both the synthetic and natural reference image types, and moderate for no reference images (see Fig. 19A). In absolute terms, the largest increase in performance occurs for the none condition; for natural reference images there was also a small increase; for synthetic reference images, there was a slight decrease (see Fig. 19B and C). In the question session after the experiments, many participants reported remembering the repeated trials from the first time.

(a) Proportion of trials that were answered the same upon repetition.
(b) Performance for repeated trials upon repetition.
(c) Performance for repeated trials when first shown.
Figure 19: Repeated trials in Experiment I.

a.2.5 Qualitative Findings

In a qualitative interview conducted after completion of the experiment, participants reported to use a large variety of strategies. Colors, edges, repeated patterns, orientations, small local structures and (small) objects were commonly mentioned. Most but not all participants reported to have adapted their decision strategy throughout the experiment. Especially lay participants from Experiment II emphasized that the trial-by-trial feedback was helpful and that it helped to learn new strategies. As already described in the main text, participants reported that the task difficulty varied greatly; while some trials were simple, others were challenging. A few participants highlighted that the comparison between minimally and maximally activating images was a crucial clue and allowed employing the exclusion criterion: If the minimally activating query image was easily identifiable, the choice of the maximally activating query image was trivial. This aspect motivated us to conduct an additional experiment where the presentation scheme was varied (Experiment II).

a.2.6 High quality data as shown by high performance on catch trials

We integrate a mechanism to probe the quality of our data: In catch trials, the correct answer is trivial and hence incorrect answers might suggest the exclusion of specific trial blocks (for details, see Sec. A.1.1). Fortunately, very few trials are missed: In Experiment I, only two (out of ten) participants miss one trial each; in Experiment II, five participants miss one trial and four participants miss two trials. As this indicates that our data is of high quality, we do not perform the analysis with excluded trials as we expect to find the same results.

a.3 Details on Related Work

Explanation Methods
    Feature Visualization
    natural images888Baseline condition.
    no explanation88footnotemark: 8

  high variance in

 confidence ratings
    natural images are
 more helpful
& Refiano
    Guided Backprop
    simple gradient88footnotemark: 8
    highest confidence
 for guided backprop999Metrics of explanation quality computed without human judgment are inconclusive and do not correspond to human rankings.
Chu et al.
    prediction + gradients
    prediction88footnotemark: 8
    no information88footnotemark: 8
    faulty explanations
 do not decrease
Shen &
    Extremal Perturb
    no explanation88footnotemark: 8
Schmidt &
    custom method
    random explanation88footnotemark: 8
    no explanation88footnotemark: 8
    humans trust own
 judgement regardless
 explanations, except
 in one condition
Hase &
    Decision Boundary
    combination of all 4
    high variance in
    helpfulness cannot
 predict user per-
et al.
    fairly high trust
 and reliance
et al.
    no explanation88footnotemark: 8
    high confidence
 for Anchor
    low for LIME &
 no explanation
et al.
    prediction + Anchor
    prediction88footnotemark: 8
    no information88footnotemark: 8
    explanations do not
 increase confidence
Paper Experimental Setup
Dataset Task Participants Collected Data
    natural images
    CNN activation
    reaction time
    post-hoc evaluation
& Refiano
    face images
 classification101010Task has an additional “I don’t know”-option for confidence rating.
    reaction time
Chu et al.
    face images
    reaction time
    post-hoc evaluation
Shen &
    natural images
    model error
Schmidt &
    book categories
    Movie reviews


    reaction time
Hase &
    movie reviews
 (Movie Review
    helpfulness rating
    explanation helpfulness
et al.
 (Patient data)
    feature ranking
et al.
 (Adult, rcdv)
 classification1010footnotemark: 10
    reaction time
et al.
 Census Income)
    reaction time
Table 4: Overview of publications that evaluate explanation methods in human experiments. Note that the table already starts on the previous pages.