1 Introduction
Throughout the last decade, (black box) machine learning (ML) techniques have made impressive performance leaps on even the most complex tasks [1, 2, 3, 4]
, especially in the form of Deep Neural Networks (DNN)
[5]. These models are typically trained (or pretrained) on very large datasets, e.g., ImageNet [6], with millions of samples. Recently, it was discovered that biases, spurious correlations, as well as errors in the training dataset may have a detrimental effect on the training resulting in “Clever Hans” predictors [7], which only superficially solve the task they have been trained for. Unfortunately, due to the immense size of today’s datasets, a direct manual inspection and removal of artifactual samples can be regarded hopeless. However, analyzing the biases and artifacts in the model instead, may provide insights about the biases and artifacts in the training data indirectly. For that purpose we would, however, need to inspect the learning models and operate them beyond black box mode.Only recently methods of explainable AI (XAI) (c.f. [8]
for an overview) were developed. They provide deeper insights into how an ML classifier arrives at its decisions and potentially help to unmask Clever Hans predictors. XAI methods can be roughly categorized into two groups: methods providing
local explanations and those providing global explanations [9]. Here, local explanations increase transparency on individual predictions of the model and assess the importance of input features w.r.t. specific samples. Local explanations are commonly presented in the form of attribution or heatmaps aligned to the input space, which can be computed e.g. from propagationbased (e.g. [10, 11, 12, 13]) or surrogatebased techniques (e.g. [14, 15, 16]). Global methods on the other hand aim at obtaining model understanding by assessing the general importance of features a model relies on. These features often are highlevel concepts and are either chosen/designed manually, or directly and discretely are accessible in input space [17, 18, 19].Both the local and global approaches suffer from a (human) investigator bias during analysis, thus are of limited use when searching for biases, spurious correlations, and errors in the training dataset. Global methods can only measure the impact of predetermined, and expected or known features (c.f. [18, 19]), which limits the applicability when aiming to discover all behavioral facets of a model (including specifically the ones unknown beforehand). Local methods, on the other hand, have the potential to provide much more detailed information per sample, but compiling information about model behavior over thousands or millions of samples and explanations is a tiring and laborious process. Furthermore, the success of such an analysis depends on the examiner’s keen perception and domain knowledge, at times limiting the potential for knowledge discovery.
Our current contribution aims at bridging the vast gap between the discussed two extremes in an objective and automated manner making use of scalable statistical inference on millions of heatmaps, specifically employing the Spectral Relevance Analysis (SpRAy) [7] technique. SpRAy constitutes a semiautomated statistical analysis of large numbers of locally explaining attribution maps, with the intent of automated summarization and clustering of model strategies via local attribution maps. Thus, SpRAy is aligned with the assumption that an ML model might base its predictions on multiple substrategies (instead of only globally effective ones) for recognizing a target class and distinguishing it from others. Early applications of SpRAy [7] demonstrate its utility for knowledge discovery via strategy summarization on gameplay sequences of Atari2600 playing DNN agents [2], and the detection of multiple flaws in dataset composition and model architecture design in the context of the formerly widely used Pascal VOC image recognition benchmark [20, 7].
In this paper, we extend SpRAy to make it better applicable for largescale analyses on datasets with hundreds of classes and millions of samples, such as ImageNet [6]. Our technical contributions are: (a) a new Wassersteinbased similarity metric for scale, translational, and rotational invariant comparisons of attributions, (b) the identification of artifactual and biased samples in the ImageNet corpus, and a quantitative analysis of their impact on the Clever Hans’ness of the classifier, (c) a systematic cleaning procedure for the artifacts and biases to reduce CleverHans behavior, i.e. we unHans the ImageNet dataset. These allow interesting findings that are illuminating beyond our specific technical approach.
2 Methods
We will first briefly summarize SpRAy from [7] (see Figure 1 for a procedural overview), emphasizing and motivating where and how we go beyond [7]. An algorithmic summary of the technique can be found in Algorithm 1.
2.1 Spectral Relevance Analysis brought to scale
The SpRAy technique is a metaanalysis tool for finding patterns in model behavior, given sets of instancebased explanatory attribution maps. The algorithm is based on Spectral Clustering (SC)
[21, 22] and performs the following sequence of computations.Computing attributions SpRAy analyzes the (spatial) structure described by a set of heat/attribution maps, each locally explaining single model decisions w.r.t. to a model prediction the user is interested in. Following [7], we provide attributions computed with Layerwise Relevance Propagation (LRP) [10] according to the recommended composite (or layerdependent) strategy [23, 24]. Specifically, since our analyses are based on a pretrained VGG16 [25] type DNN we follow [24] and apply LRP to the model’s dense layers, LRP in the lowest convolutional layer and LRP (=1) in all other convolutional layers, using the preconfigured LRPanalyzers of the iNNvestigate [26] toolbox. We sum attribution scores along the color channel axis to obtain a single attribution value per pixel coordinate.
Preprocessing of attributions The work of [7] analyzes the behavior of DNN predictors, as well as a former stateoftheart model from the bagofwords family, the improved Fisher Kernel SVM [27]. Other than the DNN, the latter predictor does not expect inputs of a fixed size and therefore LRP computes nonuniformly sized attribution maps over the analyzed data. The authors of [7] resort to sumpooling attribution scores from arbitrarily sized explanation maps onto a 2020 sized grid, resulting in a 400dimensional representation per attribution map for further processing. The authors justify the often extreme size or dimensionality reduction with an increased (regional) stability of the analysis and a decrease in computational cost. They also point out that this step – albeit useful – is no practical necessity.
Since in this paper we only process uniformly sized attribution maps of 224224 pixels as output of LRP and the VGG16 model, we omit the optional preprocessing and rather preserve the complete and unaltered structural information within the LRP heatmap attributions. We show that albeit absent this preprocessing step, we can discover classspecific and consistent cleverhans strategies on ImageNet.
Computing distances and affinities An ingredient of SpRAy as performed in [7] is a comparison of heatmaps based on the euclidean distance, which focuses on the task of finding regional consistencies between attribution maps. In order to further expand the scope of SpRAy, we consider the GromovWasserstein distance (GWD) [28] as an additional distance measure.
As [7], we compute pairwise (binary) affinity scores between all samples using the
nearestneighbor (KNN) algorithm
[29] on the previously computed distance measures and store them in a matrix . We visualize this step of the procedure, along with all following steps, in Figure 2, intuitively demonstrating of the spectral Clustering (SC) process.We find that for ImageNet classes, choosing a fixed =10 instead of == (c.f. [30, 7]. =1,300 is the number of samples per ImageNet object category) yields the most consistent and robust groupings throughout the spectral analysis. Values for
larger than 10 seem to have no, or only little effect, while smaller values lead to an overlyfractured affinity structure. The affinity matrix
is then symmetrized with(1) 
which leads to and if and already have been mutually connected prior to the symmetrization, and otherwise.
Computing the spectral embedding The computation of the spectral embedding is the analytic core of the SpRAy method. Given a matrix describing the affinity structure of the data, following [7] we first compute the symmetrized and normalized p.s.d. graph laplacian [22, 30]
(2) 
where is the diagonal degree matrix of the connectivity graph described by , and the unnormalized graph laplacian [30]. The entries of the diagonal matrix are computed as
(3) 
Then, an eigenvalue decomposition of is performed, which yields eigenvalues
and eigenvectors as
columns of .The set of returned eigenvalues informs about the cluster structure discovered in the data [30], where completely separated clusters are indicated by the number of eigenvalues =0, which is rarely the case with real world data. Figure 2(b) demonstrates that the structure and number of embedded clusters can also be inferred from the eigengap, a sudden increase in difference of neighboring eigenvalues. The first eigengap in the toy example identifies four almost completely disjunct groups of points.
The rows in constitute the spectral embedding of the data in , reflecting the affinity structure encoded in and . Cluster labels can be assigned to the embedding by using e.g. meansclustering on the spectral embedding. In the toy example in Figure 2(c) we choose =4 according to the eigenvalue spectrum and show the assigned labels using color coding on the spectral embedding projected to . Due to the correspondence of the input data to the rows of , we can directly assign the computed labels in input space aswell (Figure 2(d)).
On ImageNet, where =1,300, we restrict =32 for the computation of the eigendecomposition. Specifically, we use the computationally efficient and iterative Lanczos algorithm^{1}^{1}1via the sparse.linalg.eigsh function provided by the SciPy [31] package for Python [32], which has in general a computational complexity of , which reduces to when only the most discriminative eigenvectors are to be computed.
2.2 Alternative distance measures
SpRAy has originally used euclidean distances to compute the neighborhood graph [7]
. In the application of images, this means that image similarity is identified by spatial properties,
i.e. having the same attribution intensities at the same pixel renders high similarity. This is a reasonable approach, especially if one would like to focus on spatial properties such as watermarks or padding. However, when the domain of interest are spatially unrelated shapes or color distributions, other measures of similarity may be needed.
A recently very popular distance metric is the Optimal Transport, or WassersteinDistance. In the context of computer vision, it is also known as the EarthMover’s Distance [33]. Its benefit is that it “feels” like a very natural distance metric [34].
Wasserstein distances use distances between spatially fixed points over the same identical image grid. GromovWasserstein [28] distances matches points by their pairwise distances, instead of using a fixed image grid with a fixed amount of points. This means that however points are spatially distributed, if in both sets there are points whose pairwise relations are similar, then their GromovWasserstein distance will be small. A somewhat intuitive visualization of euclidean distance, Wasserstein distance, and GromovWasserstein distances is shown in Figure 3. We show 4 samples of handwritten digits [35] in 4 corners, translated and rotated. All images that lie on the line between the corners are barycenters [36] of the corner images, weighted by the Chebyshev distance to all samples. The metrics used to compute the barycenters are the 3 previously mentioned metrics. Wasserstein barycenters are computed as in [34]. For the GromovWasserstein distance, we need to compute pairwise distances between points in the image. Points are extracted from the images by choosing each pixel one after another, starting with the largest, until percent of the total sum of all pixel values is reached. We can nicely see that the Wasserstein distance seems translation invariant, but fails with different rotations. GromovWasserstein distance shows to be invariant to rotation, translation, and mirroring, since all the information is contained in only the pairwise relations. Thus, to enable invariant comparison and in this manner maximally match regionally independent shapes in our analysis, we use the GromovWasserstein distance measure.
2.3 Fisher Discriminant Analysis for Clever Hans identification
A critical decision in clustering approaches is the number of desired clusters. Since [7] analysed the comparatively small Pascal VOC dataset [20] with classes and almost samples, going over each class individually and exploring the eigenvalues for a possibly significant eigengap is a straightforward task. For ImageNet [6] with its classes and million samples, looking at each and every class eigengap manually is a timeconsuming and exhausting venture. We would therefore like to establish some score to measure how interesting a class is in terms of classification strategies that separate themselves significantly from all other possible strategies, as they are good candidates for Clever Hans behaviors. Such a score could furthermore be used to rank all classes, so that only those which produce a high cluster separability score , can be selected for thorough investigation. Fisher Discriminant Analysis (FDA) [37, 38] is a widely popular method for classification as well as class (or cluster) structure preserving dimensionality reduction. FDA finds an embedding space by maximizing betweenclass scatter and minimizing withinclass scatter . It can be understood as finding the direction(s) of maximal separability between classes. The criterion as chosen by [38] for multiple classes is given as
(4)  
(5) 
where is the projection matrix that minimizes withinclass scatter
(6) 
while maximizing betweenclass scatter
(7) 
Here, is the set of clusters and its cardinality, the mean of cluster and the mean over all samples. The projection matrix can be found by solving the generalized eigenvalue problem
(8) 
with where are the generalized eigenvectors corresponding to the eigenvalues . By reducing the projection matrix to only use the eigenvectors corresponding to the largest eigenvalues and inserting into (4), we obtain a score of separability. In our application, since we already act on embedded data, we found that using only the eigenvector corresponding to the largest eigenvalue, e.g. the direction of largest separability, gives a reasonable score for ranking the classes. We compute means clusterings for each class individually with ranging from to , and compute the separability score for each. Then, we compute the mean separability over all clusterings of each class as . Classes with a high mean separability score are Clever Hans candidates since then the strategies to classify exhibit highly variable behavior. Table 1 in Section 3 lists the ImageNet classes with the highest and lowest values.
2.4 Cluster structure visualization
Following [7] tdistributed Stochastic Neighborhood Embedding (tSNE) [39] is used to visualize the cluster structure described by the analyzed heatmaps. To preserve a strong connection between the cluster assignments from SC and the 2Dembeddings visualized with tSNE, the authors reuse the affinity structure in as input to tSNE.
We extend this approach by using the spectral embedding as input to compute visualizable point clouds in . We observe significantly less overlap among differently labelled clusters in the visualized point clouds when embedding the from SC instead of computing embeddings from . See Figure 2(c) for projections of the spectral embedding into , computed with tSNE for in context of a toy example.
Alternatively to tSNE, we consider the recent UMAP [40]. The qualitative difference between tSNE and UMAPembeddings are minimal yet result in reduced intracluster scatter and increased intercluster scatter for UMAP. This is however to be expected, as we merely aim to embed for visualization, and representative spectral embeddings have already been computed with SC.
3 Experiments and Evaluations
We apply our extended SpRAy pipeline to the ImageNet dataset as described in Section 2 and Algorithm 1. That is, per class, we compute relevance attribution heatmaps for the training data of ImageNet using LRP w.r.t. to the true label. Note, that the proposed procedure is also readily applicable to all members of the XAI zoo (cf. [8]).
3.1 Identifying clusters and classes with FDA
From the separability scores computed with FDA on the spectral embeddings we can now readily identify those classes with concentrated and isolated clusters of attribution maps. Some example scores for classes with high and low separability scores are shown in Table 1. The highest separability score is achieved by class “laptop”, of which the UMAP of its spectral embedding with a significant cluster is depicted in Figure 4 (top). Clearly the visualized cluster is extremely well separated from all the other samples. For this clustering in particular, we see that a cluster of highly similar examples results in a high separability score: The “laptop” class of the ImageNet corpus contains a set of samples showing rendered instances of laptop computers in (on pixellevel) identical poses. In contrast, class “sliding door” achieves the lowest separability score of all classes. Figure 4 (bottom) shows an exemplary cluster for this class. Its UMAP shows a distribution of attributions that seems to be hard to separate, i.e. there is no decision strategy that can be clearly categorized, e.g. by importance of region.
top classes  bottom classes  

laptop  4.77  0.44  fountain 
stethoscope  1.28  0.44  home_theater 
book_jacket  1.14  0.43  wallet 
bottlecap  1.13  0.43  thresher 
tennis_ball  1.13  0.43  pencil_sharpener 
clumber  1.12  0.42  bannister 
stole  1.06  0.41  sliding_door 
3.2 Model understanding and hypothesis testing
By closely inspecting the clusters identified using FDA, we can formulate hypotheses about the model’s decision strategies. We can recognize groupings of complicated shapes, invariant of scale, location or translation on clusters found with GromovWasserstein distance at the base of SpRAy. Examples for two distilled clusters from classes “ringneck snake” — where the snake’s head and its brightly colored neck appear to be the relevant features — and “great grey owl” — where the patterns highlighting the face (eyes and beak) and shape of the head seem to be the common denomiator — can be seen in Figure 5. However, despite the favorable invariance properties, deducting distinct hypotheses for these strategies turns out to be a nontrivial task, since clusters are semantically much harder to interpret compared to groupings found with a euclidean distance at the root. As expected, euclidean distancebased results exhibit tight groupings of attribution maps with shared regional concentrations of attribution scores, which is rather intuitive to interpret, even without much domain knowledge. We thus continue with a euclideanbased analysis.
We have discovered various interesting CleverHans’ moments in ImageNet. In the following we will go into detail for four examples of such strategies, which are depicted in Figure
6. Two significant clusters can be found in the class “stole”: One which contains samples where all four corners are digitally rounded, apparently created by the same author (photographer). Heatmaps for those samples show very high relevance for this rounded corner property (pointed at by a red marker). It is worth to note that for the leftmost example shown in the figure, which was in contrast to all others incorrectly classified, the bottomleft corner of the image is not deemed relevant by the model (pointed at by a blue marker). The second, densely packed cluster for class “stole” shows a wooden mannequin “head” consistently used by the model for predicting the true class. Another intriguing cluster is found in the class “garbage truck”. In the identified cluster, we can see a significant number of images with a characteristic watermark placed in the bottom left corner, which is consistently picked up by the model as a relevant feature. The class “stethoscope”, which received the second highest score, shows padding added to the top and bottom of the image, which has not been introduced during a modelspecific preprocessing step, but rather are part of the images themselves. The class “jigsaw puzzle” shows a presumably digitally pasted identical patterns on top several the source images. At least three distinct variants of this puzzle pattern have been discovered using SpRAy. The consistency of this pattern across multiple samples allows the model to overfit to this data artefact. Finally, a series of samples from class “mountain bike” shows a nearidentical gray border padding picked up by the model.For each of these observations, we can formulate the hypothesis that the model is biased on the via heatmaps highlighted properties towards their respective class. To investigate whether this is true, we construct an ablation study by isolating the artifact source and adding it to samples of other classes. That is, for class “stole” we create a digital mask of rounded corners from the affected original images and extract one of the shown wooden “mannequin heads” as a freely placable image layer. For class “mountain bike” we replicate the gray border and for “jigsaw puzzle” we extract all three discovered digital puzzle patterns. If the model then shifts its decision from the the ground truth label of the samples from the “other classes“ towards the class of the artifact, we can safely deduce that the model is biased with respect to this artifact. We summarized the results for a quantitative verification of selected hypotheses in Table 3, with mean prediction rank difference and mean prediction difference . A significant increase in prediction rank is clearly visible towards the shown artefact classes, except for class “mountain bike”. A possible explanation of this deviation of class “mountain bike” from the expected trend is revealed upon closer inspection of additional data artefacts. Several other classes, such as (e.g. classes “expresso maker” and “guillotine”), show similar border padding effects in some of the images. Furthermore, we expect to also see increased relevance on the artifact in the attribution for the modified image. Examples demonstrating isolated artifacts added onto different class samples, and the models’ reaction to the artefact when computing heatmaps for the artefact’s class of origin, are visualized in Figure 7.
By adding a discovered data artifact to samples of other classes we are able to quantify its importance to the detection of the labelled concept. By removing
an artefact, we can estimate to what degree a model has learned to (solely) base its decision on the artefactual feature. If the model reacts strongly to the removal of the artefactual image feature, it has (with high probability) resorted to the artefact as a main source of information for the respective target class. If the model does only show a weak reaction or none at all, it may have learned (several) backup strategies for detecting the concept of the target label. We measure the model’s sensitivity to the artefacts discussed in this section, by using digital inpainting techniques on the affected samples in the validation set. Table
4 compiles measurements and for artefact removals on classes “stole”, “jigsaw puzzle” and “mountain bike”. While the prediction for class “mountain bike” is almost completely unaffected again, and the classifier seems to have developed backup plans for predicting class “stole” in the absence of rounded image corners and wooden mannequin heads, the “jigsaw puzzle” classifier catastrophically fails in two out of three cases when a discovered digitally pasted jigsaw puzzle pattern is removed from the affected samples. The model has thus, for the class “jigsaw puzzle”, strongly overfitted to the discovered dataset bias.As an additional interesting observation, we have also found classes with examples in the validation set that show the same type of artifacts as used in some of the discovered Clever Hans prediction strategies (e.g. see Figure 8), putting the model’s performance on the validation set for any of the affected classes in question.
3.3 UnHans’ing the model by fixing the data
A range of the prediction biases of Clever Hans type identified within our study exhibits variations of highly systematic artifactual patterns. The yellow watermark used to predict class “garbage truck”, e.g., always occurs in the bottom left corner of the affected images in Figure 6 and Figures 9 and 10. It should thus now be possible to use this understanding of the biased decision strategy for the purpose of model rectification, with the intent of making the model forget its association of the yellow watermark artefact to class “garbage truck”.
For this purpose, we design an experiment on a reduced set of ImageNet classes, including the affected class “garbage truck”, as well as four other randomly chosen classes. This subset (henceforth called subset A) consists of training samples and validation samples and shall serve as a baseline. We then create a copy B of subset A, and isolate the watermark (see Figure 9 Mid) supposedly causing the prediction bias in class “garbage truck” from the affected training images. Within the training partition of the subset B, the isolated watermark is then added to the bottom left corner of all (yet unaffected) training samples in order to disable the watermark as a source of information which can be associated to one specific class, i.e. “garbage truck”. Starting from the original (pretrained) VGG16 model, we finetune two neural networks, one on the sets A and B each. We compare the prediction performance on the unaltered validation partition of subset A, as well as the model’s behavior via LRP heatmaps. First of all, it is to be expected that both finetuned models perform well on this reduced (and thus simpler) problem set. However, it is interesting to note that the model trained on subset B significantly outperforms model A with vs. top1 accuracy on the validation set of the ImageNet subset A finetuning.
Observing the relevance attribution maps for both finetuned models shown in Figures 9 and Figures 10 reveals that the model trained on subset variant B has consistently ceased to rely on the watermark for decision making after only 10 epochs, (already starting from the first epoch), while gradually shifting the base of its decisionmaking to the storage container of the vehicle. The model trained on A, however, retains the copyright watermark as part of its prediction strategy. The absence of the watermark within the unaltered validation set A explains the lower performance of model A, which obviously relies on the watermark to “generalize” on unseen data. We investigate this finding by exposing both finetuned models to a validation set which has seen the treatment of ImageNet subset B, i.e., where all validation samples show an artificially added copyright tag. In this case, the accuracy of the model trained on A drops to , while the model trained on B retains its original performance, further verifying that model A still associates the watermark to a higher degree to class “garbage truck”, while model B has largely disassociated class and artefactual feature. The reported accuracy ratings are summarized in Table 2. We have thus – albeit in a toy scenario – for the first time successfully unlearned a CleverHans strategy discovered with attributions and SpRAy, resulting in an improved model B above its baseline A. This marks an excellent starting point to unHans data corpora.
validation on A  validation on B  

training on A  98.2%  96.4% 
training on B  99.8%  99.8% 
class  bias  samples  

stole  rounded corners  2000  58.14  0.0004 
stole  mannequin “head”  10  106.10  0.0081 
jigsaw puzzle  jigsaw pattern 1  2000  220.98  0.0160 
jigsaw puzzle  jigsaw pattern 2  2000  355.60  0.8415 
jigsaw puzzle  jigsaw pattern 3  2000  356.42  0.9540 
mountain bike  watermark  2000  101.02  0.0001 
class  bias  samples  

stole  rounded corners  10  0.70  0.1756 
stole  mannequin “head”  13  0.62  0.3713 
jigsaw puzzle  jigsaw pattern 1  44  0.11  0.0146 
jigsaw puzzle  jigsaw pattern 2  44  112.52  0.9160 
jigsaw puzzle  jigsaw pattern 3  44  208.41  0.9305 
mountain bike  watermark  17  0.00  0.0206 
4 Conclusion
Deep Learning models have gained high practical usability by pretraining on large corpora and then reusing the learned representation for transferring to novel related data. A prerequisite for this practice is the availability of large corpora of rather standardized and, most importantly, representative data. If artifacts or biases are present in data corpora, then the representations formed are prone to inherit these flaws. This is clearly to be avoided, however, it requires either clean data or detection and subsequent removal of artifacts, biases etc. of data bases that would cause dysfunctional representation learning.
In this paper we have used explanation methods (LRP attributions [10], for an overview see [8, 41]) and specifically extended SpRAy, a technique that has been successfully used to unmask Clever Hans behavior, for automatically and scalably detecting subtle and less subtle flaws in the ImageNet corpus. One strand of our analysis was devoted to the question of how to properly reflect scaling, translations and rotations when comparing attributions in the clustering step of SpRAy. Here we found the Wasserstein distance to be a versatile candidate for achieving a metric encompassing the mentioned crucial invariances. Furthermore, our comprehensive qualitative and quantitative analysis based on the scalable technique proposed above reveals for different classes in ImageNet rather unexpected Clever Hanstype strategies [7] of the popular VGG16 deep learning model (to which also other architectures are sensitive to; see Appendix). These are caused by a zoo of artifacts and biases isolated by our framework in the corpus; these encompass: copyright tags, unusual image formatting, specific cooccurrences of unrelated objects, cropping artifacts, just to name a few. Detecting this zoo gives not only insight but also the possibility for relieving ImageNet from its Clever Hans moments, i.e. we are now able to unHans the ImageNet corpus and provide a more consistent basis for pretrained models. We demonstrated this in an unlearning experiment for class “garbage truck” (see above and Figures 9 and 10), and further ImageNet classes in the Appendix. Note that without removing such data artifacts, learning models are prone to adopt Clever Hans strategies [7], thus, giving the correct prediction for an artifactual/wrong reason. This makes them especially vulnerable to adversarial attacks that can harvest all such artifactual issues in a data corpus [42]. Future work will therefore focus on the important intersection between security and functional cleaning of data corpora, e.g., to lower the attack risk when building on top of pretrained models.
Acknowledgement
This work was supported by the Brain Korea 21 Plus Program through the National Research Foundation of Korea; the Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government [No. 2017000451]; the Deutsche Forschungsgemeinschaft (DFG) [grant Math+, EXC 2046/1, Project ID 390685689]; and the German Ministry for Education and Research (BMBF) as Berlin Big Data Center (BBDC) [01IS14013A], Berlin Center for Machine Learning (BZML) [01IS18037I] and TraMeExCo [01IS18056A].
References
 [1] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[2]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
et al.
, “Humanlevel control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.  [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.

[4]
K. T. Schütt, F. Arbabzadah, S. Chmiela, K.R. Müller, and A. Tkatchenko, “Quantumchemical insights from deep tensor neural networks,”
Nature Communications, vol. 8, p. 13890, 2017. 
[5]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagennet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105, 2012.  [6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 [7] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.R. Müller, “Unmasking clever hans predictors and assessing what machines really learn,” Nature Communications, vol. 10, p. 1096, 2019.
 [8] W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K.R. Müller (Eds.), “Explainable AI: Interpreting, explaining and visualizing deep learning,” Springer LNCS 11700, 2019.
 [9] S. M. Lundberg, G. G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S. Lee, “Explainable AI for trees: From local explanations to global understanding,” CoRR, vol. abs/1905.04610, 2019.
 [10] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.R. Müller, and W. Samek, “On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation,” PLoS ONE, vol. 10, no. 7, p. e0130140, 2015.
 [11] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Gradcam: Visual explanations from deep networks via gradientbased localization,” in Proc. of IEEE International Conference on Computer Vision (ICCV), pp. 618–626, 2017.
 [12] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in Proc. International Conference on Machine Learning (ICML), pp. 3319–3328, JMLR.org, 2017.
 [13] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” in Proc. of International Conference on Machine Learning (ICML), pp. 3145–3153, 2017.
 [14] M. T. Ribeiro, S. Singh, and C. Guestrin, “’why should I trust you?’: Explaining the predictions of any classifier,” in Proc. of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1135–1144, 2016.
 [15] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling, “Visualizing deep neural network decisions: Prediction difference analysis,” in Proc. of International Conference on Learning Representations (ICLR), 2017.
 [16] R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in Proc. of IEEE International Conference on Computer Vision (ICCV), pp. 3449–3457, 2017.

[17]
I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,”
Journal of machine learning research, vol. 3, no. Mar, pp. 1157–1182, 2003. 
[18]
B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler, F. B. Viégas, and R. Sayres, “Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV),” in
Proc. of International Conference on Machine Learning (ICML), pp. 2673–2682, 2018.  [19] R. Rajalingham, E. B. Issa, P. Bashivan, K. Kar, K. Schmidt, and J. J. DiCarlo, “Largescale, highresolution comparison of the core visual object recognition behavior of humans, monkeys, and stateoftheart deep artificial neural networks,” Journal of Neuroscience, vol. 38, no. 33, pp. 7255–7269, 2018.
 [20] M. Everingham, L. Gool, C. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge results,” URL: http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ workshop/everingham_cls.pdf, 2007.

[21]
M. Meila and J. Shi, “A random walks view of spectral segmentation,” in
Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics, AISTATS 2001, Key West, Florida, US, January 47, 2001
, 2001. 
[22]
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in
Advances in Neural Information Processing Systems, pp. 849–856, 2002.  [23] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.R. Müller, “Layerwise relevance propagation: an overview,” in Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 193–209, Springer LNCS 11700, 2019.
 [24] M. Kohlbrenner, A. Bauer, S. Nakajima, A. Binder, W. Samek, and S. Lapuschkin, “Towards best practice in explaining neural network decisions with LRP,” CoRR, vol. abs/1910.09840, 2019.
 [25] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2014.
 [26] M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K. T. Schütt, G. Montavon, W. Samek, K.R. Müller, S. Dähne, and P.J. Kindermans, “innvestigate neural networks!,” Journal of Machine Learning Research, vol. 20, pp. 93:1–93:8, 2019.
 [27] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for largescale image classification,” in Proc. of European Conference on Computer (ECCV), pp. 143–156, 2010.
 [28] G. Peyré, M. Cuturi, and J. Solomon, “Gromovwasserstein averaging of kernel and distance matrices,” in Proc. of International Conference on Machine Learning (ICML), pp. 2664–2672, 2016.
 [29] N. S. Altman, “An introduction to kernel and nearestneighbor nonparametric regression,” The American Statistician, vol. 46, no. 3, pp. 175–185, 1992.
 [30] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.

[31]
E. Jones, T. Oliphant, P. Peterson, et al.
, “SciPy: Open source scientific tools for Python,” 2001–.
[Online; accessed ].  [32] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950.
 [33] Y. Rubner, C. Tomasi, and L. J. Guibas, “A metric for distributions with applications to image databases,” in Procedings of the Sixth International Conference on Computer Vision (ICCV98), Bombay, India, January 47, 1998, pp. 59–66, 1998.
 [34] J. Solomon, F. de Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. J. Guibas, “Convolutional wasserstein distances: efficient optimal transportation on geometric domains,” ACM Trans. Graph., vol. 34, no. 4, pp. 66:1–66:11, 2015.

[35]
Y. LeCun, “The mnist database of handwritten digits.”
http://yann.lecun.com/exdb/mnist/, 1998.  [36] M. Cuturi and A. Doucet, “Fast computation of Wasserstein barycenters,” in Proceedings of the International Conference on Machine Learning (ICML), pp. 685–693, 2014.
 [37] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936.

[38]
K. Fukunaga, “Chapter 1  introduction,” in
Introduction to statistical pattern recognition
, Boston: Academic Press Professional, Inc., 1990.  [39] L. v. d. Maaten and G. Hinton, “Visualizing data using tsne,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
 [40] L. McInnes and J. Healy, “UMAP: uniform manifold approximation and projection for dimension reduction,” CoRR, vol. abs/1802.03426, 2018.
 [41] G. Montavon, W. Samek, and K.R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital Signal Processing, vol. 73, pp. 1–15, 2018.
 [42] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57, IEEE, 2017.
 [43] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 2261–2269, 2017.
Appendix
We complement our findings from our manuscript with additional data within this appendix, by (1) demonstrating how the prediction artefacts discovered for the VGG16 classifier also reappear on other DNN models and (2) showing the effects of the unHans’ing experiment, which has been described in detail in Section 3.3, when performed on other classes and artifacts.
4.1 ImageNet Artefacts Affecting Additional DNNs
In Section 3.2 we describe a series of systematic prediction biases discovered using the SpRAy technique for several affected classes. In all these cases, the downloaded VGG16 model has overfit on input features which are characteristic for certain object classes in context of the ImageNet dataset. We thus assume that other neural network architectures sharing the same data source for training may also share certain Clever Hans strategies with the investigated VGG16 classifier.
Figure 11 exemplarily shows LRP heatmaps computed for the VGG19 [25] and the DenseNet121 [43] model — which have also been downloaded as pretrained predictors optimized on the ImageNet data corpus — for samples which exhibit data artefacts as discovered for the VGG16 model. We notice that both the architecturally very similar VGG16 and VGG19 architectures heatmaps are very concentrated on shape features such as edges and colorgradient rich image areas. The heatmaps computed for the DenseNet121 model on the other hand are much more focused on class and objectpecific textures and colors. For all investigated samples, we however notice that all three models tend to use the same w.r.t. to the true class semantically unrelated yet corellated features for prediction. That is, for class “carton”, all three models support their predictions with a set of barely visible and centered watermark consisting of asian characters for prediction, as well as a second orange and small watermark appearing in the bottom right corner of “carton” images with high frequency. Similarly for classes “garbage truck”, “jigsaw puzzle” and “stole” shown in Figure 11 all three models support their prediction based on the discovered yellow watermark, the cutouts of the digitally added puzzle pattern, the rounded image corners and the wooden mannequin head.
Considering the systematicity of use of these data artefacts by all three models, we strongly recommend a thorough categorization of Clever Hans behavior of machine learning models and their data sources essential components of future dataset creation efforts.
4.2 Additional Cases of UnHans’ing
In Section 3.3, we described the unHans’ing procedure with the intent to force the neural network model to forget learned yet unintended featuretoclass associations in detail for class “garbage truck” affected a yellow watermark artefact. Here, we repeat the experiment of Section 3.3 for artefacts discovered for classes “stole” and “jigsaw puzzle”.
Figure 12 demonstrates a setting highly similar to the one for class “garbage truck” discussed in Section 3.3: The discovered data artefact — here a digitally rounded image corners with white background — exhibits extremely high regional consistency and only covers very limited parts of the image area. Once the isolated corner feature has been added to all samples during our experiment, the model quickly has disassociated the artefact from the label “stole”. After continued retrainng, LRP begins to attribute negative relevance to rounded image corners, indicating that the process of unHansing went beyond mere forgetting by creating a negative association between corner artefact and class label.
The second data artefact discoverd for class “scole” is a frequently shown wooden “mannequin head” coappearing with the woven stoles themselves. Since here, the expression of the artefact was much more diverse in pose and position and has shown almost no regional consistency, we manually isolated a (very) limited amount of prototypical “mannequin heads” from the data and randomly (within reason) added wooden stump as an image element to each sample of each batch during retraining. Figure 13 shows the progression of unHansing at hand of two different input sample. While for the sample shown at the top of the figure the model has not disassociated between this particular expression of the “mannequin head” feature (at times, the feature’s accumulated positive relevance even increased), the model has ceased to support its prediction for class “stole” with the artefactual feature for the bottom image.
Lastly, we investigate the “digital jigsaw puzzle pattern” artefact discoverd for class “jigsaw puzzle”, which appears in multiple variants. Each variant, however, is expressed with almost complete and pixelidentical consistency. We therefore select one variant of the artefact and add it as a mask to all training samples of the unHansing training subset B extracted from the ImageNet corpus. Here again, we can observe that the model forgets the association between this particular pattern and the class label “jigsaw pattern”: In Figure 14, positive relevance completely disappears from the digital jigsaw pattern during unHansing, such that the feature is not used anymore for predicing “jigsaw”. What prevails, however, is a strongly negative relevance map on the fornicating ladybug pair of ladybugs, indicating the model’s reasoning that the insects’ presence speaks against class “jigsaw” (and rather for a competing network output). The effect of forgetting this consistently expressed yet very large artefactual feature has an understandably catastrophic effect to the model’s capability to predict the original ImageNet label for affected samples (c.f. Table 4).
We conclude that while according to our experiments precisely applying brain damage to a pretrained neural network model is definitely possible, its execution may in some cases — at least while through manipulation of the training data in pixel space — be nontrivial.