Toward Interpretable Machine Learning: Transparent Deep Neural Networks and Beyond

With the broader and highly successful usage of machine learning in industry and the sciences, there has been a growing demand for explainable AI. Interpretability and explanation methods for gaining a better understanding about the problem solving abilities and strategies of nonlinear Machine Learning such as Deep Learning (DL), LSTMs, and kernel methods are therefore receiving increased attention. In this work we aim to (1) provide a timely overview of this active emerging field and explain its theoretical foundations, (2) put interpretability algorithms to a test both from a theory and comparative evaluation perspective using extensive simulations, (3) outline best practice aspects i.e. how to best include interpretation methods into the standard usage of machine learning and (4) demonstrate successful usage of explainable AI in a representative selection of application scenarios. Finally, we discuss challenges and possible future directions of this exciting foundational field of machine learning.


page 3

page 6

page 12

page 14

page 15

page 17


Toward Explainable AI for Regression Models

In addition to the impressive predictive power of machine learning (ML) ...

Pitfalls of Explainable ML: An Industry Perspective

As machine learning (ML) systems take a more prominent and central role ...

Explainable Biometrics in the Age of Deep Learning

Systems capable of analyzing and quantifying human physical or behaviora...

The Pragmatic Turn in Explainable Artificial Intelligence (XAI)

In this paper I argue that the search for explainable models and interpr...

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

We present a brief history of the field of interpretable machine learnin...

Lessons Learned from the 1st ARIEL Machine Learning Challenge: Correcting Transiting Exoplanet Light Curves for Stellar Spots

The last decade has witnessed a rapid growth of the field of exoplanet d...

Interpretability of Machine Learning Methods Applied to Neuroimaging

Deep learning methods have become very popular for the processing of nat...

I Introduction

A main goal of machine learning is to learn accurate decision systems respectively predictors that can help automatizing tasks, that would otherwise have to be done by humans. Machine Learning (ML) has supplied a wealth of algorithms that have demonstrated important successes in the sciences and industry; most popular ML work horses are considered to be kernel methods (e.g. [150, 126, 104, 125, 154]) and particularly during the last decade deep learning methods (e.g. [21, 43, 87, 86, 124, 54]) have gained highest popularity.

As ML is increasingly used in real-world applications, a general consensus has emerged that high prediction accuracy alone may not be sufficient in practice [85, 24, 123]. Instead, in practical engineering of systems, critical features that are typically considered beyond excellent prediction itself are (a) robustness of the system to measurement artefacts or adversarial perturbations [141], (b) resilience to drifting data distributions [40], (c) ability to accurately assess the confidence of its own predictions [111, 107], (d) safety and security aspects  [19, 66, 23, 153], (e) legal requirements or adherence to social norms [45, 49], (f) ability to complement human expertise in decision making [64], or (g) ability to reveal to the user the interesting correlations it has found in the data [70, 127].

Orthogonal to the quest for better and more holistic machine learning models, interpretable ML [123, 56, 90, 100, 162, 48, 16, 12] has developed as a subfield of machine learning that seeks to augment the training process, the learned representations and the decisions with human-interpretable explanations. An example is medical diagnosis, where the input examples (e.g. histopathological images) come with various artifacts (e.g. stemming from image quality or suboptimal annotations) that have in principle nothing to do with the diagnostic task, yet, due to the limited amount of available data, the ML model may harvest specifically these spurious correlations with the prediction target (e.g. [50, 138]). Here interpretability could point at anomalous or awkward decision behavior before harm is caused in a later usage as a diagnostic tool.

Similarly essential when using ML in the sciences is again interpretabilty, since ideally, the transparent ML model — having learned from data — may have embodied scientific knowledge that would subsequently provide insight to the scientist, occasionally this can even be novel scientific insight (see e.g. [127]). — Note that in numerous scientific applications it has been most common so far to use linear models [118], favoring interpretabilty often at the expense of predictivity (see e.g. [51, 93]).

To summarize, there is a strong push toward better understanding ML systems that are being used and in consequence blackbox algorithms are more and more abandoned for many applications. This growing consensus has led to a strong growth of a subfield of ML, namely explainable AI (XAI) that strives to produce transparent nonlinear learning methods, and supplies novel theoretical perspectives on machine learning models, along with powerful practical tools for a better understanding and interpretation of AI systems.

In this review paper, we will summarize the recent exciting developments, present different classes of XAI methods, provide theoretical insights, and highlight the current best practices when applying interpretability. Note finally, that we do not attempt an encyclopedic treatment of all available XAI literature, rather, we present a slightly biased point of view illustrating the main ideas (and in doing so we often draw from the work of the authors) and providing — to the best of our knowledge — reference to related work for further reading.

Ii Towards Explaining Deep Neural Networks

To introduce basic concepts of interpretable machine learning, in particular, what is an explanation, and how to produce it, we will consider as a starting point a fairly general class of machine learning models. The model will be assumed to have been fully trained and its prediction behavior to be describable in an abstract manner by a function

This function receives as input a vector of real-valued features

typically corresponding to various sensor measurements. The function produces as an output a real-valued score on which the decision is based. Classification results are then obtained by verifying whether the output is above a certain threshold or larger than the output of other functions representing the remaining classes. The function output can be interpreted as the amount of evidence for / against deciding in favor of a certain class. A sketch of such function receiving two features and as input is given in Fig. 1.

Fig. 1: Example of a nonlinear function of the input features, which produces some prediction. The function can be approximated locally as a linear model.

In a medical scenario, the function may receive as input an array of clinical variables, and the output of the function may be a prediction of the medical condition [77]. In an engineering setting, the input could be the composition of some compound material, and the output could be a prediction of its strength [156] or stability.

Suppose a given instance is predicted by the machine learning model to be healthy, or a compound material is predicted to have high strength. We may choose to trust the prediction and go ahead with next step within an application scenario. However, we may benefit from taking a closer look at that prediction, e.g. to verify that the prediction ‘healthy’ is associated to relevant clinical information, and not some spurious features that accidentally correlate with the predicted quantity in the dataset [82, 85]. Such problem can often be identified by building an explanation of the ML prediction [85].

Conversely, suppose that another instance is predicted by the machine learning model to be of low health or low strength. Here, an explanation could prove equally useful as it could hint at actions to be taken on the sample to improve its predicted score [149], e.g. possible therapies in a medical setting, or small adjustments of the compound design that lead to higher strength.

Ii-a How to Explain: Global vs. Local

Numerous approaches have emerged to shed light onto machine learning predictions. Certain approaches such as activation-maximization [135, 108, 106] aim at a global interpretation of the model, by identifying prototypical cases for the output quantity and allowing in principle to verify that the function has a high value only for the valid cases. While these prototypical cases may be interesting per se, both for model validation or knowledge discovery, such prototypes will be of little use to understand for a given example (say, near the decision boundary) what features play in favor or against the model output

Specifically, we would like to know for that very example what input features contribute positively or negatively to the given prediction. These local analyses of the decision function have received growing attention and many approaches have been proposed [14, 159, 13, 116, 142]. For simple models with limited nonlinearity, the decision function can be approximated locally as the linear function [13]:


where is some nearby root point (cf. Fig. 1). This expansion takes the form of a weighted sum over the input features, where the summand can be interpreted as the contribution of feature to the prediction. Specifically, an inspection of the summands reveals that a feature will be attributed strong relevance if the following two conditions are met: (1) the feature must be expressed in the data, i.e. it differs from the reference value , and (2) the model output should be sensitive to the presence of that feature, i.e. . An explanation for the prediction can then be formed by the vector of relevance scores . It can be given to the user as a histogram over the input features or as a heatmap.

For illustration, consider the problem of explaining a prediction for a data point from the Concrete Compressive Strength Data Set [156]. For this data point, a simple two-layer neural network model predicts a low compressive strength. Applying the analysis above gives an explanation for this prediction, which we show in Fig. 2.

Fig. 2: Input example predicted to have low compressive strength, and a feature-wise explanation of the prediction. Red and blue color indicate positive and negative contributions.

For this example low cement concentration and below average age are factors of low compressive strength, although this is partly compensated by a high quantity of blast furnace slag.

Furthermore, for an explanation to be interpretable by its receiver, the latter must be able to make sense of the input features. Some features such as ‘cement’, ‘water’, and ‘age’, are understandable to everyone, however, more technical terms such as ‘blast furnace slag’ or ‘superplaticizer’ may only be accessible to a domain expert. Therefore, when using these explanation techniques, we make the implicit assumption that those input features are interpretable to the receiver.

Ii-B Deep Networks and the Difficulty of Explaining Them

In practice, linear models or shallow neural networks may not be sufficiently expressive to predict the task optimally. Deep neural networks have been proposed as a way of producing more predictive models. They can be abstracted as a sequence of layers

where each layer applies a linear transformation followed by an element-wise nonlinearity. Combining a large number of these layers endows the model with high prediction power. DNNs have proven especially successful on computer vision tasks

[80, 136, 52]. However, DNN models are also much more complex and nonlinear, and quantities entering into the simple explanation model of Eq. (1

) become considerably harder to compute and to estimate reliably.

A first difficulty comes from the multiscale and distributed nature of neural network representations. Some neurons are activated for only a few data points, whereas others apply more globally. The prediction is thus a sum of local and global effects, which makes it difficult (or impossible) to find a root point

that linearly expands to the prediction for the data point of interest. The transition from the global to local effect indeed introduces a nonlinearity, which Eq. (1) cannot capture.

A second source of instability arises from the high depth of recent neural networks, where a ‘shattered gradient’ effect was observed [15]

, noting that the gradient locally resembles white noise. In particular, it can be shown that for deep rectifier networks, the number of discontinuities of the gradient can grow in the worst case exponentially with depth

[101]. The shattered gradient effect is illustrated in Fig. 3 (left) for the well-established VGG-16 network [136]: The network is fed multiple consecutive video frames of an athlete lifting a barbell, and we observe the prediction for the output neuron ‘barbell’. The gradient of the prediction is changing its value much more quickly than the prediction itself. An explanation based on such gradient would therefore inherit this noise.

Fig. 3: Two difficulties encountered with DNNs. Left: Shattered gradient effect causing gradients to be highly varying and too noisy to be used for explanation. Right: Pathological minima in the function, making it difficult to search for meaningful reference points.

A last difficulty comes from the challenge of searching for a root point on which to base the explanation, that is both close to the data and not an adversarial example [44, 107]. The problem is illustrated in Fig. 3 (right), where we showcase a reference point that does not carry any meaningful visual difference to the original data , but for which function output has changed dramatically. The problem of adversarial examples can be explained by the gradient noise, that causes the model to ‘overreact’ to certain pixel-wise perturbations, and also by the high dimensionality of the data (

pixels for the ImageNet dataset) where many small pixel-wise effects cumulate into a large effect on the model output.

Iii Practical Methods for Explaining DNNs

In view of the multiple challenges posed by analyzing deep neural network functions, building robust and practical methods to explain their decisions has developed into an own research area [100, 48, 123] and an abundance of methods have been proposed. In parallel, efficient software (cf. Appendix C for a list) makes these newly developed methods readily usable in practice, and allows researchers to perform systematic comparisons between them on reference models and datasets.

In this section, we focus on four families of explanation techniques: Interpretable Local Surrogates, Occlusion Analysis, Gradient-based techniques, and Layer-Wise Relevance Propagation. In our view, these techniques exemplify the current diversity of possible approaches to explaining predictions in terms of input features, and taken together provide a broad coverage of the types of models to explain and the practical use cases. We give references to further related methods in the corresponding subsections. Table II in Appendix C provides a glossary of all referenced methods.

Iii-a Interpretable Local Surrogates [116]

This category of methods aims to replace the decision function by a local surrogate model that is structured in a way that it is self-explanatory (an example of a self-explanatory model is the linear model). This approach is embodied in the LIME algorithm [116]

, which was successfully applied to DNN classifiers for images and text. Explanation can be achieved by first defining some local distribution

around our data point , learning the parameter of the linear model that best matches the function locally:

and then extracting local feature contributions, e.g. . Because the method does not rely on the gradient of the original DNN model, it avoids some of the difficulties discussed in Section II-B

. The LIME method also covers the incorporation of sparsity or simple decision tree to the surrogate model to further enhance interpretability. Additionally, the learned surrogate model may be based on its own set of

interpretable features, allowing to produce explanations in terms of features that are maximally interpretable to the human. Interpretable structures are also contained in much more complex models. For example, the CAM architecture [163] is formed by a sequence of convolutional layers followed by a top-level Global Average Pooling [89] that aggregates class features at various spatial locations. Relevant spatial locations are then readily given by the activations that enter into this top-level pooling layer.

Iii-B Occlusion Analysis [159]

Occlusion Analysis is a particular type of perturbation analysis where we repeatedly test the effect on the neural network output, of occluding patches or individual features in the input image [159, 165]:

where is an indicator function for the patch or feature to remove, and ‘’ denotes the element-wise product. A heatmap can be built from these scores highlighting locations where the occlusion has caused the strongest decrease of the function. Because occlusion may produce visual artefacts, inpainting occluded patterns (e.g. using a generative model [2]) rather than setting them to gray was proposed as an enhancement. A further extension of occlusion analysis is Meaningful Perturbation [38], where an occluding pattern is synthesized, subject to a sparsity constraint, in order to engender the maximum drop of the function value . The explanation is then readily given by the synthesized pattern. The perturbation-based approach was latter embedded in a rate distortion theoretical framework [94].

Iii-C Integrated Gradients / SmoothGrad [142, 137]

The first method we consider is Integrated Gradients [142]. It explains by integrating the gradient along some trajectory in input space connecting some root point to the data point . The integration process addresses the problem of locality of the gradient information (cf. II-B), making it well-suited for explaining functions that have multiple scales. In the simplest form, the trajectory is chosen to be the segment connecting some root point to the data. Integrated gradients defines feature-wise scores as:

It can be shown that these scores satisfy and thus constitute a complete explanation. If necessary, the method can be easily extended to any trajectories in input space. For implementation purposes, integrated gradients must be discretized. Specifically, the continuous trajectory is approximated by a sequence of data points . Integrated gradients is then implemented as shown in Algorithm 1.

  for  to  do
  end for
Algorithm 1 Integrated Gradients

The gradient can easily be computed using automatic differentiation. The operation ‘’ denotes the element-wise product. The larger the number of discretization steps, the closer the output gets to the integral form, but the more computationally expensive the procedure gets.

Another popular gradient-based explanation method is SmoothGrad [137]. The function’s gradient is averaged over a large number of locations corresponding to small random perturbations of the original data point :

Like the method’s name suggests, the averaging process ‘smoothes’ the explanation, and in turn also addresses the shattered gradient problem described in Section II-B. (See also [102, 14, 135] for earlier gradient-based explanation techniques).

In Section IV, we experiment with a combination of Integrated Gradients and SmoothGrad [137], where relevance scores obtained from Integrated Gradients are averaged over several integration paths that are drawn from some random distribution. The resulting method preserves the advantages of Integrated Gradients and further reduces the gradient noise.

Iii-D Layer-Wise Relevance Propagation [13]

The Layer-wise Relevance Propagation (LRP) method [13] makes explicit use of the layered structure of the neural network and operates in an iterative manner to produce the explanation. Consider the neural network

First, activations at each layer of the neural network are computed until we reach the output layer. The activation score in the output layer forms the prediction. Then, a reverse propagation pass is applied, where the output score is progressively redistributed, layer after layer, until the input variables are reached. The redistribution process follows a conservation principle analogous to Kirchoff’s laws in electrical circuits. Specifically, all ‘relevance’ that flows into a neuron at a given layer flows out towards the neurons of the layer below. At a high-level, the LRP procedure can be implemented as a forward-backward loop, as shown in Algorithm 2.

  for  do
  end for
  for  do
  end for
Algorithm 2 Layer-wise Relevance Propagation

The function relprop performs redistribution from one layer to the layer below and is based on ‘propagation rules’ defining the exact redistribution policy. Examples of propagation rules are given later in this section, and their implementation is provided in Appendix B. The LRP procedure is shown graphically in Fig. 4.

Fig. 4: Illustration of the LRP propagation procedure applied to a neural network. The prediction at the output is propagated backward in the network, using various propagation rules, until the input features are reached. The propagation flow is shown in red.

While LRP can in principle be performed in any forward computational graph, a class of neural networks which is often encountered in practice, and for which LRP comes with efficient propagation rules that can be theoretically justified (cf. Section V) is deep rectifier networks [42]. The latter can be in large part abstracted as an interconnection of neurons of the type:

where denotes some input activation, and is the weight connecting neuron to neuron in the layer above. The notation indicates that we sum over all neurons in the lower layer plus a bias term with . For this class of networks, various propagation rules have been proposed (cf. Fig. 4). For example, the LRP- rule [98] defined as


redistributes based on the contribution of lower-layer neurons to the given neuron activation, with a preference for positive contributions over negative contributions. This makes it particularly robust and suitable for the lower-layer convolutions. Other propagation rules such as LRP- or LRP-0 are suitable for other layers [98]

. Additional propagation rules have been proposed for special layers such as min/max pooling

[13, 99, 68] and LSTM blocks [11, 9]. Furthermore, a number of other propagation techniques have been proposed [133, 132, 81, 160]

with some of the rules overlapping with LRP for certain choices of parameters. For a technical overview of LRP including a discussion of the various propagation rules and further recent heuristics, see 


An inspection of Eq. (2) shows an important property of LRP, that of conserving relevance from layer to layer, in particular, we can show that in absence of bias terms, . A further interesting property of this propagation rule is ‘smoothing’. Consider relevance can be written as and a product of activations and factors. Those factors can be directly related by the equation


This equation can be interpreted as a smooth variant of the chain rule for derivatives used for computing the neural network gradient

[97]. Thus, analogous to SmoothGrad [137], LRP also performs some gradient smoothing, however, it embeds it tightly into the deep architecture, so that only a single backward pass is required. In addition to smoothing, Eq. (3) can also be interpreted as a gradient that has been biased to positive values, an idea also found in methods such as DeconvNet [159] or Guided Backprop [139]. This modified gradient view on LRP can also be leveraged to achieve a simpler and more general implementation of LRP based on ‘forward hooks’, which we describe in the second part of Appendix B, and which we use to apply LRP on VGG-16 [136] and ResNet-50 [52] in Section IV.

Iv Comparing Explanation Methods

The methods presented in Section III highlight the variety of approaches available for attributing the prediction of a deep neural network to its input features. This variety of techniques also translates into a variety of qualities of explanations. Illustrative examples of images and the explanation of predicted evidence for the ground truth class as produced by the different explanation methods are shown in Fig. 5. Occlusion Analysis is performed by occluding patches of size

pixels with stride

. Integrated Gradients performs integration steps starting from random points near the origin in order to add smoothing (cf. Appendix A), resulting in function evaluations. LRP explanations are obtained by applying the same LRP rules as in [98]. We observe the following qualitative properties of the explanations: Occlusion-based explanations are coarse and are indicating relevant regions rather than the relevant pixel features. Integrated Gradients produces very fine pixel-wise explanations containing both substantial amount of evidence in favor and against the prediction (red and blue pixels). LRP preserves the fine explanation structure but tends to produce less negative scores and attributes relevance to whole features rather than individual pixels.

Fig. 5: Examples of images from ImageNet [121] with classes ‘space bar‘, ‘beacon/lighthouse‘, ‘snow mobile‘, ‘viaduct‘, ‘greater swiss mountain dog‘. Images are correctly predicted by the VGG-16 [136] neural network, and shown along with an explanation of the predictions. Different explanation methods lead to different qualities of explanation.

In practice, it is important to reach an objective assessment of how good an explanation is. Unfortunately, evaluating explanations is made difficult by the fact that it is generally impossible to collect ‘ground truth’ explanations. Building such ground truth explanations would indeed require the expert to understand how the deep neural network decides.

Standard machine learning models are usually evaluated by the utility (expected risk) of their decision behavior (e.g. [150]). Transposing this concept of maximizing utility to the domain of explanation, quantifying the utility of the explanation would first require to define what is the ultimate target task (the explanation being only an intermediate step), and then assessing by how much the use of explanation by the human increases its performance on the target task, compared to not using it (see e.g. [14, 34, 123]). Because such end-to-end evaluation schemes are hard to set up in practice, general desiderata for ML explanations have been proposed [143, 95]. Common ones include (1) faithfulness (2) human-interpretability, and (3) possibility to practically apply it to an ML model or an ML task (e.g. algorithmic efficiency of the explanation algorithm).

Iv-a Faithfulness

Faithfulness is a property of the explanation to reliably and comprehensively represent the decision structure of the analyzed ML model. A practical technique to quantify faithfulness is ‘pixel-flipping’ [122]. The pixel-flipping procedure tests the faithfulness of an explanation by verifying whether removing the features highlighted by the explanation (as most relevant) leads to a strong decay of the network prediction abilities. The procedure is summarized in Algorithm 3.

  pfcurve = [ ]
  for  in  do
      (remove pixel from the image).
  end for
  return  pfcurve
Algorithm 3 Pixel-Flipping

Pixel-flipping runs from the most to the least relevant input features, iteratively removing them and monitoring the evolution of the neural network output. The series of recorded decaying prediction scores can be plotted as a curve. The faster the curve decreases, the more faithful the explanation method is w.r.t. the decision of the neural network. The pixel-flipping curve can be computed for a single example, or averaged over a whole dataset in order to get a global estimate of the faithfulness of an explanation algorithm under study.

Fig. 6 applies pixel-flipping to the three considered explanation methods and on two models: VGG-16 [136] and ResNet-50 [52]

. At each step of pixel-flipping, removed pixels are imputed using a simple inpainting algorithm, which avoids introducing visual artefacts in the image.

Fig. 6: Pixel-flipping experiment for testing faithfulness of the explanation. We remove pixels found to be the most relevant by each explanation method and verify how quickly the output of the network decreases.

We observe that for all explanation methods, removing relevant features quickly destroys class evidence. In particular, they perform much better than a random explanation baseline. Fine differences can however be observed between the methods: For example, LRP performs better on VGG-16 than on ResNet-50. This can be explained by VGG-16 having a more explicit structure (standard pooling operations for VGG-16 vs. strided convolution for ResNet-50), which better supports the process of relevance propagation (see also [117] for a discussion of the effect of structure on the performance of explanation methods).

A second observation in Fig. 6 is that Integrated Gradients has by far the highest decay rate initially but stagnates in the later phase of the pixel-flipping procedure. The reason for this effect is that IG focuses on pixels to which the network is the most sensitive, without however being able to identify fully comprehensively the relevant pattern in the image. This effect is illustrated in Fig. 6 (middle) on a zoomed-in exemplary image of class ‘greater swiss mountain dog’, where the image after 10% flipping has lost most of its prediction score, but visually appears almost intact. Effectively, IG has built an adversarial example [144, 107], i.e. an example whose visual content clearly disagrees with the prediction at the output of the network. We note that Occlusion and LRP do not run into such adversarial examples. For these methods, pixel-flipping steadily and comprehensively removes features until class evidence has totally disappeared.

Overall, the pixel-flipping algorithm characterizes various aspects of the faithfulness of an explanation method. We note however that faithfulness of an explanation does not tell us how easy it will be for a human to make sense of that explanation. We address this other key requirement of an explanation in the following section.

Iv-B Human Interpretability

Here, we discuss whether the presented explanation techniques deliver results that are meaningful to the human, i.e. whether the human can gain understanding into the classifier’s decision strategy from the explanation. Human interpretability is hard to define in general [95]. Different users may have different capabilities at reading explanations and at making sense of the features that support them [116, 105]. For example, the layman may wish for a visual interpretation, even approximate, whereas the expert may prefer an explanation supported by a larger vocabulary, including precise scientific or technical terms [14].

For the image classification setting, interpretability can be quantified in terms of the amount of information contained in the heatmap (e.g. as measured by the file size). An explanation with a small associated file size is more likely to be interpretable by a human. The table below shows average file sizes (in bytes111JPEG compression using the Pillow image processing library for python with a quality setting of 75/100 (standard settings).) associated to the various explanation techniques and for two neural networks.

VGG-16 698.4 5795.0 1828.3
ResNet-50 693.6 5978.0 2928.2

We observe that occlusion produces the lowest file size and is therefore the most ‘interpretable’. It indeed only presents to the user rough localization information without going into the details of which exact feature has supported the decision as done e.g. by LRP. On the other side of the interpretability spectrum we find Integrated Gradients. In the explanations this last method produces, every single pixel contains information, and this makes it clearly overwhelming to the human.

In practice, neural networks do not need to be explained in terms of input features. For example, the TCAV method [71] considers directional derivatives in the space of activations (where the directions correspond to higher-level human-interpretable concepts) in place of the input gradient. Similar higher-level interpretations are also possible using the Occlusion and LRP methods, respectively by perturbing groups of activations corresponding at a given layer to a certain concept, or by stopping the LRP procedure at the same layer and pooling scores on some group of neurons representing the desired concept.

Iv-C Applicability and Runtime

Faithfulness and interpretability do not fully characterize the overall usefulness of an explanation method. To characterize usefulness, we also need to determine whether the explanation method is applicable to a range models that is sufficient large to include the neural network model of interest, and whether explanations can be obtained quickly with finite compute resources.

Occlusion-based explanations are the easiest to implement. These explanations can be obtained for any neural network even those that are not differentiable. This also includes networks for which we do not have the source code and where we can only access their prediction through some online server. Technically, occlusion can therefore be used to understand the predictions of third-party models such as and Integrated gradients requires instead for each prediction an access to the neural network gradient. Given that most machine learning models are differentiable, this method is widely applicable also for neural networks with complex structures, such as ResNets [52] or SqueezeNets [62]

. Integrated Gradients is also easily implemented in state-of-the-art ML frameworks such as PyTorch or TensorFlow, where we can make use of automatic differentiation.

Layer-wise Relevance Propagation assumes that the model is structured as (or can be converted to [67, 68]

) a neural network with a canonical sequence of layers, for example, an alternation of linear/convolution layers, ReLU layers, and pooling layers. This stronger requirement and the implementation overhead caused by explicitly accessing the different layers (cf. Appendix

B) will however be offset by a last characteristic we consider in this section, which is the computational cost associated producing the explanation. A runtime comparison222Explanations are computed in batches of (up to) 16 samples on a GPU and with explanation techniques implemented in PyTorch. Results are averaged over 10 repetitions. of the three explanation methods studied here is given in the table below (measured in explanations per second).

VGG-16 2.4 5.8 204.1
ResNet-50 4.0 8.7 188.7

Occlusion is the slowest method as it requires to reevaluate the function for each occluded patch. For image data, the runtime of Occlusion increases quadratically with the step size, making the obtainment of high-resolution explanations with this method computationally prohibitive. Integrated Gradients inherits pixel-wise resolution from the gradient computation which is but requires multiple iterations for the integration. The runtime is further increased if performing an additional loop of smoothing. LRP is the fastest method in our benchmark by an order of magnitude. The LRP runtime is only approximately three times higher than that of computing a single forward pass. This makes LRP particularly convenient for the large-scale analyses we introduce in Section VI-B where an explanation needs to be produced for every single example in the dataset.

V Unifying Views on Explanation Methods

In parallel to developing explanation methods that address application requirements such as faithfulness, interpretability, usability and runtime, some works have focused on building theoretical foundations for the problem of explanation [99, 92] and establishing theoretical connections between the different methods [133, 5, 100].

Here, we consider frameworks based on Taylor expansions. This includes the basic Taylor decomposition procedure [13, 17] and as well as an extension of it, the Deep Taylor Decomposition [99]. We then show how Occlusion, Integrated Gradients, or LRP intersect for certain choices of parameters with these mathematical approaches.

V-a Taylor Decomposition

Taylor expansions are a well-known mathematical framework to decompose a function into a series of terms associated to different degrees and combinations of input variables. The Taylor expansion of some smooth and differentiable function at some reference point is given by:

where and denote the gradient and the Hessian respectively. The zero-order term is the function value at the reference point and is zero if choosing a root point. There are as many first-order terms as there are dimensions and each of them is bound to a particular input variable. Thus, they offer a natural way of attributing a function value onto individual linear components. There are as many second-order terms as there are pairs of ordered variables, and even more third-order and higher-order terms. When the function is approximately locally linear, second and higher-order terms can be ignored, and we get the following simple attribution scheme:

a product of the gradient and the input relative to our root point. In the general case, there are no closed-form approach to find the root point and it is instead obtained using an optimization technique.

V-B Deep Taylor Decomposition

An alternate way of formalizing the problem of attribution of a function onto input features is offered by the recent framework of Deep Taylor Decomposition [99]. Deep Taylor Decomposition assumes the function is structured as a deep neural network and seeks to attribute the prediction onto input features by performing a Taylor decomposition at every neuron of each layer instead of directly on the whole neural network function. Deep Taylor decomposition assumes the output score have already been attributed onto some layer of activations and attribution scores are denoted by . Deep Taylor Decomposition then considers the function where is the collection of neuron activations in the layer below. These quantities are illustrated in Fig. 7.

Fig. 7: Graphical illustration of the function that DTD seeks to decompose on the input dimensions. Because is complex, it is often replaced by an analytically more tractable model that only depends on local activations.

The function is typically very complex as it corresponds to a composition of multiple forward and backward computations. This function can however be approximated locally by some ‘relevance model’ , the choice of which will depend on the method we have used for computing . We then compute a Taylor expansion of this function:

The linear terms define ‘messages’ that can be redistributed to neurons in the lower layer, and messages received by a given neuron at a certain layer are summed to form a total relevance score:


here, we have added an index to the root point to make explicit that different root points can be used for expanding different neurons. The redistribution procedure is iterated from the top layer towards the lower layers, until the input features are reached.

V-C Embedding Explanation Methods into the (Deep) Taylor Decomposition Framework

Having described the simple and Deep Taylor Decomposition frameworks, we now present some results from the literature showing how some explanation methods reduce for certain choices of parameters to these frameworks. The different connections we outline here are summarized in Fig. 8.

Fig. 8:

Relation between explanation methods and Taylor decomposition / Deep Taylor Decomposition (DTD), for certain choices of hyperparameters and models.

We start with a connection between occlusion-based explanation and Taylor decomposition.

Proposition 1.

When applied to homogeneous linear models (of the type ), occlusion with patch size and replacement value is equivalent to a Taylor decomposition with root point .

This is shown by the chain of equations . Integrated Gradients can also be reduced to a Taylor decomposition and this connection also holds in particular for deep rectifier networks (without biases):

Proposition 2.

When applied to deep rectifier networks of the type , Integrated Gradients with integration path is equivalent to Taylor decomposition at with almost zero.

This can be shown by making the preliminary observation that a deep rectifier network is linear with constant gradient on the segment and then applying the chain of equations . This connection, along with the observation that a single gradient evaluation of a deep network can be noisy (cf. Section II-B) speaks against integrating on the segment . For this reason, we have opted in the experiments of Section IV to use a smoothed version of IG. A further result shows an equivalence between a ‘naive’ version of LRP-0 and Taylor decomposition.

Proposition 3.

For deep rectifier nets of the type , applying LRP-0 at each layer is equivalent to a Taylor decomposition at with almost zero.

This result can be derived by taking the LRP formulation of Eq. (3) and setting . This equation then reduces to:

where . This equation is exactly the one that propagates gradients in a deep rectifier network. Hence, the input relevance computed by LRP becomes for which we have already shown the equivalence to simple Taylor decomposition in the proposition above.

Proposition 4.

For deep rectifier networks of the type , applying LRP- is equivalent to performing one step of deep Taylor decomposition and choosing the nearest root point on the line .

We choose the relevance model with constant (cf. [98] for a justification). Injecting the root point in the first-order terms of DTD (summands of Eq. (4)) gives:

where is resolved using the conservation equation . LRP- is a special case of LRP- with . A similar procedure with another choice of reference point gives LRP- (cf. [98]).

Vi Explanations for Unsupervised Learning and Beyond

Deep neural networks have been shown to perform extremely well on classification or regression tasks, however for other problems such as anomaly detection or clustering, k-means and kernel-based models such as one-class SVMs respectively have remained highly popular workhorses. As these models are not given in the form of a neural network, and furthermore are composed of strongly nonlinear functions such as the exponential, a direct application of methods designed in the context of linear models and DNNs is not feasible.

Vi-a Neuralization

Neuralization [67, 68]

was recently introduced as a framework for explainable machine learning, where non-neural architectures are translated into neural networks, in order to enhance their explanation properties. In other words, we identify a neural network structure that is functionally equivalent to the model to explain and ensure that this functional ‘copy’ is furthermore only composed of ‘canonical’ neural network functions, e.g. linear or pooling. This general concept of neuralization was first introduced in the context of explanation methods for unsupervised learning, namely, one-class SVMs

[68] and k-means clustering models [67], where combinations of kernel RBF functions can be rewritten as pooling operations over linear or distance functions.

Vi-A1 Neuralizing Clustering

Consider a kernel k-means model of the type studied in [31]. For this type of model, and assuming a Gaussian kernel

, the probability ratio in favor a given cluster

can be expressed as:


This is a power-assignment model applied to the kernel density functions of each cluster. The sets and are the representatives for clusters and , and are respective normalization factors. An example of decision function produced by this model for a three-cluster problem is shown in Fig. 9 (left). Clearly, Eq. (5) is a priori not composed of neurons. However, it can be reorganized into the following sequence of detection and pooling functions [67]:

with and are parameters of the first linear layer. This layer is followed by a hierarchy of log-sum-exp computations interpretable as canonical max- and min-pooling operations. The neuralized version of kernel k-means is depicted in Fig. 9 (right).

Fig. 9: Left:

Kernel k-means applied to a toy two-dimensionnal problem with three clusters. Red and blue color in the background represent the positive and negative values of the logit function for a given cluster.

Right: 4-layer neural network equivalent of the kernel k-means logit score [67].

Vi-A2 Neuralizing SoftMax Layers

The concept of neuralization can also be extended for the purpose of improving the explanation for deep neural networks. So far, we have explained quantities at the output of the last linear layer. Because these output quantities are unnormalized they may respond positively to several classes, thereby lacking selectivity. The problem of class selectivity was highlighted e.g. in [47, 63, 98] and practical solutions were proposed. Here, we present the ‘neuralization’ approach in [98], which first makes the observation that ratios of probabilities as given by the top-layer soft-assignment model can be expressed as:

This computation can then be reorganized in the two-layer neural network

where is a soft minimum implemented by a log-sum-exp computation. The DNN processing up to the output neuron or up to the output of the neuralized logit model is illustrated in Fig. 10 along with LRP explanations for these two quantities associated to the class ‘passenger_car’.

Fig. 10: Deep neural network to which we append a neuralized version of the log-likelihood ratio [98]. Considering the latter quantity instead of the DNN output leads to a different explanation.

In the first explanation, both the passenger car and the locomotive can be seen to contribute. In the second explanation, the locomotive turns blue. The latter is indeed speaking for the class locomotive, which mechanistically lowers the probability for the class ‘passenger_car’ [98]. This example shows that it is important in presence of correlated features to precisely define what quantity (unnormalized score or logit) we would like to explain.

We note that while neuralization has served here to support LRP-type explanations, the concept could potentially be extended to other explanation frameworks. The identified neural network structure may help to gain further understanding of the model or provide intermediate representations that are potentially useful to solve related tasks.

Vi-B Dataset-Wide Statistics on Explanations

In practice, we may not only be interested in explaining how the DNN predicts a single data point, but also in the statistics of them for a whole dataset. This may be useful to validate the model in a more complete manner. Let be a function that takes a data point as input and predicts evidence for a certain class for each data point. Consider a dataset of such data points. The total class evidence can be represented as a function where:

This composition of the neural network output and a sum-pooling remains explainable by all methods surveyed here, however, the explanation is now high-dimensional ().

Vi-B1 Relevance Pooling

Practically, we may be not be interested in explaining every single data point in terms of every single input features. A more relevant information to the user would be the overall contribution of a subgroup of features on a group of data points  (cf. [82, 100]). In particular the Integrated Gradient and LRP methods surveyed here produce explanations that satisfy the conservation property:

and that can be converted to a coarse-grained explanation

that still satisfies the desired conservation property. As an illustration of the concept, we consider the ‘Concrete Compression Strength’ example of Section II. Data points are grouped in three k-means clusters, and features are grouped in two sets: the singleton , and the set of all remaining features describing concrete composition. The pooled analysis is illustrated in Fig. 11.

Fig. 11: Pooled analysis. Top: Feature-wise contributions for the prediction on three clusters of the Concrete Compressive Strength Dataset [156]. Bottom: Coarse-grained explanations obtained by pooling contributions on data clusters and groups of features.

This analysis gives further insight into our predictive model. We observe that most distinguishing factors, especially age, contribute negatively to strength. In other words, a ‘typical’ age and composition is a recipe for strength whereas high/low values tend to be explanatory for weakness. Notably, one data cluster stands out by having composition features that are explanatory for strength.

Vi-B2 Spectral Relevance Analysis (SpRAy) [85]

While in Section VI-B1 we have reduced the dimensionality through pooling, other analyses are possible. For example, the SpRAy method [85] does not assume a fixed pooling structure (e.g. a partition of data points and a partition of features), and applies instead a clustering of explanations in order to identify protypical decision behaviors. Algorithm 4 outlines the three steps procedure used by SpRAy:

  for  to  do
  end for
Algorithm 4 Spectral Relevance Analysis

The method first produces an explanation for each data point. In principle, any explanation method can be used, e.g. occlusion, integrated gradients, or LRP. Explanations are then normalized (e.g. blurred and standardized) to become invariant to small pixel-wise or saliency variations. Finally, a clustering algorithm is applied to the normalized explanation, and examples with the same cluster index can be understood as being associated with some prototypical decision strategy, e.g. looking at the object, looking at the background, etc. Alternately, the clustering step can be replaced by a low-dimensional embedding step to produce a visual map of the overall decision structure of the ML model.

Altogether, relevance pooling and SpRAy support a variety of dataset-wide analyses that are useful to explore and characterize the decision behavior of complex models trained on large datasets. Some successful applications are reviewed in Section VIII-A.

Vii Worked-Through Examples

In this paper, we have motivated the use of explanation in the context of deep learning models and showcased some methods for obtaining explanations. Here, we aim to take a practical look for the user to assess when explanation is required, what are common issues with applying explanation techniques / setting their hyperparameters, and finally, how to make sure that the produced explanations deliver meaningful insights for the human.

Vii-a Example 1: Validating a Face Classifier

In the first worked-through example we wish to train an accurate classifier for predicting a person’s age from images of faces. We will show how to use explanation for this task, in particular, to verify that the model is not using “wrong” features for its decisions.

Let us use for this the Adience benchmark dataset [35] providing 26,580 images captured ‘in the wild’ and labelled into eight ordinal groups of age ranges {(0-2), (4-6), (8-13), (15-20), (25-32), (38-43), (48-53), (60+)}.

Because the number of examples in this example is limited and likely not sufficient to extract good visual features, we adopt the common approach of starting with a generic pretrained classifier and fine-tune it on our task. We download a VGG-16 [136] neural network architecture pretrained on ImageNet [29] obtainable from First test results after training using Stochastic Gradient Descend (SGD) [87] report reasonable performance, with exact and 1-off [130, 35] prediction accuracy333Results have been averaged over the official pre-selected five-fold dataset split [35]. of 56.5% and 90.0%, respectively. Here, the 1-off accuracy considers predictions of (up to) one age group away from the true label as correct.

In order to understand the learned prediction strategies of our model and to verify that it uses meaningful features in the training data, we take an off-the-shelve explanation software, the LRP Toolbox [83]

for Caffe 

[65], and choose the method LRP configured to perform ‘LRP-’ on all layers in a first attempt. Explanations are shown for a given image in Fig. 12 (first row).

Fig. 12: Different configurations of LRP applied to VGG-16. Top: LRP- applied to the whole network explaining decision wrt. equidistantly chosen age group labels (0-2), (25-32) and (60+) respectively. Bottom: Application of the layer-dependent LRP-CMP decomposition strategy.

Some insight can be readily obtained from these explanations, e.g. the classifier has learned to ignore the background and makes his assessment mainly based on the actual person in the image. However, we also observe that explanations are overly complex with frequent local sign changes, making it hard to extract further insights, especially what are the features that contribute to different age groups. This leads to our first recommendation:

Choose the explanation technique and its parameters

Specifically, we will now try an alternate LRP preset called ‘LRP-CMP’ that applies a composite strategy [84, 98, 75] where different rules are applied at different layers. Explanations obtained with this new rule are given in Figure 12 (bottom). The new explanations highlight features in a much more interpretable way and we also start to better understand what speaks — according to the model — in favor of or against certain age groups. For example, explanations amusingly reveal baldness as a feature corresponding to both age groups (0-2) and (60+). In the shown sample, baldness contributes evidence for the classes (0-2) and (60+), while it speaks against the age group (25-32). Relatedly, the expression of the man’s chin and mouth area contradicts class (0-2) more than class (60+), but ‘looks like’ it would belong to a person aged (25-32).

Let’s now move back to the initial question, namely how to verify that the model is using the right features for predicting. While the decision structure of the model was meaningful in Fig. 12, we would like to verify it is also the case for other test cases. Figure 13 (top) shows further samples from the Adience dataset; a woman labelled (60+) and three images of the same male labelled (48-53) with smiles of varying intensities.

Fig. 13: LRP heatmaps demonstrating the effects of ImageNet [29] pretraining (middle) compared to additional IMDB-WIKI [120] pretraining (bottom). All heatmaps show the model decision wrt. age group (60+).

We apply LRP with the same preset ‘LRP-CMP’ on these images. LRP evidence for each image for the class (60+) is shown in Fig. 13 (middle). Surprisingly, according to the model, broad smiling contradicts the prediction of belonging to the age group (60+). Smiling is however clearly a confounding factor, which reliably predicts age group only to the extent that no such case is present in the training data. This predicting strategy is related to the ‘Clever Hans’444‘Clever Hans’ was a famous horse at the beginning of the 20th century, which was believed by his trainer to be capable of performing arithmetic calculations. Subsequent analyses revealed that the horse was not performing arithmetic calculations but detecting cues on the face of his trainer to produce the right answers. In machine learning, the term ‘Clever Hans’ can be used to designate strategies that mimic the expected behavior but are based on unexpected correlations or artefacts in the data [85]. effect [85] and we can therefore formulate our second recommendation:

Unmask ‘Clever Hans’ examples

Alternately, instead of screening through multiple images manually, we can also use techniques such as SpRAy [85], which perform such analysis systematically for large datasets such as ImageNet (see also Section VIII-A for successful applications).

While for the examples showcased in Fig. 13 other features may compensate for such effect, — here almost all other features of the centered faces affect the decision towards this age group positively — this will cause errors for less clear-cut cases, and this may explain why the accuracy of the ImageNet-based model is not very high, and can point at the fact that the test set accuracy may drop dramatically on new datasets, e.g. comprising more old people smiling.

Naturally, we would like our model to be robust to a subject’s mood when predicting his or her age. We thus need to find a way to prevent Clever Hans behaviors, e.g., prevent the model to associate smiling with age. One reason the model has learned that connection in the first place is the extreme population imbalance among the age groups of the Adience dataset; a problem which is shared with many other datasets of face images, e.g. [61, 120]. We therefore add a second pre-training phase in between the ImageNet initialization and the actual training based on the Adience data, by using the considerably larger IMDB-WIKI [120] dataset. The IMDB-WIKI dataset consists of 523,051 images from 20,284 celebrities (460,723 images from the Internet Movie Data Base (Imdb) and 62,328 images from Wikipedia) at different ages, labelled with 101 labels (0-100 years, one label per year). The IMDB-WIKI dataset also suffers from highly imbalanced label populations. However, we follow [120] and re-normalize the age distribution by under-sampling the more frequent classes until approximately 260,000 samples are selected overall. Furthermore, we assume that since the IMDB-WIKI dataset is composed of photos of public figures (taken at publicized events) the ratio of expressed smiles in higher age groups will be more frequent than in the Adience dataset, which has been captured ‘in the wild’. A comparison of performance on the Adience benchmark of the original model – pretrained on ImageNet only – and the improved model is given in the table below.

accuracy 1-off
ImageNet pretrained 56.5 90.0
IMDB-WIKI pretrained 63.0 96.0

Not only did the additional (and more domain-specific) pretraining step improve the generalization performance of the VGG-16 model. It also prevented the model from associating smiling exclusively with younger age groups. Figure 13 (bottom) shows LRP heatmaps for all four samples and age label (60+). For the woman, the model has shifted its attention from the hair and clothes to the face region and neck, and no longer considers the smile as contradictory to the class. A similar effect can be observed for the samples showing the male person. The model’s age prediction capabilities can no longer be attacked by just smiling into the camera. However, by introducing the IMDB-WIKI pretraining step, we have apparently replaced the smile-related Clever Hans strategy with another one, related to the presence of glasses in images of males in higher age groups. This leads to our third recommendation:

Iteratively validate and improve the model

We can do so until the model solely relies for its predictions on meaningful face features. For that, choosing a better pretraining may not be sufficient, and other more advanced interventions may be required.

Vii-B Example 2: Identifying Gender-Specific Speech Features

After demonstrating how explanations can be used to unmask Clever Hans strategies, or more generally validate a classifier, we will now discuss another use case, where explanations are this time applied not to get a better model, but to gain new (scientific) insights. In this worked-through example, we will show that explanations can be used to identify gender-specific features in speech.

Before going into the analysis, let us first introduce the data and the model used for the speaker’s gender classification task. As training data we use the recently recorded AudioMNIST [18]

dataset, comprised of 30000 audio recordings of spoken digits from 60 different speakers, with 50 repetitions per digit and speaker, in a 48kHz sampling frequency. Next to annotations for spoken digit (0-9) and gender of speaker (48 male, 12 female), the dataset provides labels for speaker age, accent and origin. We begin by training a deep neural network model on the raw waveform data, which is first downsampled to 8kHz, and randomly padded with zeroes before and after the recorded signal to obtain a 8000 dimensional input vector per sample. A CNN architecture comprised of six 1d-convolution layers interleaved with max-pooling layers and topped of with three fully connected layers 

[18] and ReLU activation units after each weighted layer is prepared for optimization. In order to prevent the model from overfitting on the more frequent population of samples labelled as ‘male’, we (randomly) select 12 speakers from both classes. The model is then trained and evaluated in a 4-fold cross-validation setting, in which the 24 speakers are grouped into four sets of 3 male speakers and 3 female speakers. Each of the four splits thus contains 1000 waveform features. Two folds are used for training, while one of the remaining data splits are reserved for validation and testing. The model reaches an average test set accuracy (

 standard deviation) of

across all splits.

With the goal of understanding the data better by explaining the model, we consider two examples predicted by the network to be male and female and apply LRP to visualize those predictions. Here, the wave form is represented as a scatter plot where each time step is color-coded by its relevance. Results are shown in Fig. 14.

Fig. 14: Explanations based on waveform representation of speech data. Correct prediction of a female (top) and male (bottom) subject. The waveform data is visualized as a scatter plot of 8000 discrete measurements, color coded according to relevance attribution for the true class label.

The explanations reveal that the model predominantly uses the outer hull of the waveform signal for decision making. For a human observer, however, these explanations are difficult to interpret due to the limited accessibility of the data representation in the first place (see Fig. 14). Although the model performs reasonably well on waveform data, it is hard to obtain a deeper understanding beyond the network’s modus operandi based on relevance maps, due to the limitations imposed via the data representation itself.

Make your input features interpretable

We therefore opt to change the data representation for improved interpretability. More precisely, we exchange the raw waveform representation of the data with a corresponding 228 230 (time

frequency) shaped spectrogram representation by applying a short-time Fourier transform (time segment length of 455 samples, with 420 samples overlap), cropped to a 227

227 matrix by discarding the highest frequency bin and the last three time segments. Consequently we also exchange the neural network architecture and use an AlexNet [80] model, which is able to process the transformed input data using 2d-convolution operators.

Figure 15 visualizes four input spectrograms, with corresponding relevance maps (only relevance values with more than 10% relative amplitude) drawn on top.

Fig. 15: Left: Spectrogram representations of digits ‘zero’ spoken by female speakers ‘vp12’ and ‘vp56’. Right: Spectrogram representations of digits ‘zero’ spoken by male speakers ‘vp2’ and ‘vp25’. Relevance maps are shown wrt. to the samples’ true classes.

Heatmap visualizations based on spectrogram input data are more informative than those for waveform data and reveal that the model has learned to distinguish between male and female speakers based on the lowest fundamental frequencies (male speakers, Fig. 15 (right)), and immediate harmonics (female speakers, Fig. 15 (left)) shown in the spectrogram. Many incorrectly classified samples with ground truth label ‘male’ show large gaps between frequency bands often occurring in samples from female speakers. Note that these insights are consistent with the literature [147].

Gain insights by explaining predictions

As a noteworthy side effect, the increase in interpretability from switching from a waveform data representation to spectrogram data representation does not come at a price of model performance. On the contrary, model performance is even increased slightly from to .

Viii Successful Use of Interpretable ML

Interpretation methods can be applied for a variety of purposes. Some works have aimed to understand the model’s prediction strategies, e.g., in order to validate the model [85]. Others visualize the learned representations and try to make the model itself more interpretable [58]. Finally, other works have sought to use explanations to learn about the data, e.g., by visualizing interesting input-prediction patterns extracted by a deep neural network model in scientific applications [146]

. Technically, explanation methods have been applied to a broad range of models ranging from simple bag-of-words-type classifiers or logistic regression

[13, 24] to feed-forward or recurrent deep neural networks [13, 133, 11, 9], and more recently also to unsupervised learning models [67, 68]. At the same time these methods were able to handle different types of data, including images [13], speech [18], text [10], and structured data such as molecules [127] or genetic sequences [151].

Some of the first successes in interpreting deep neural networks have occurred in the context of image classification, where deep convolutional networks have also demonstrated very high predictive performance [80, 52]. Explanation methods have for the first time allowed to open these “black boxes” and obtain insights into what the models have actually learned and how they arrive at their predictions. For instance, the works [135, 106]—also known in this context as “deep dreams”—highlighted surprising effects when analyzing the inner behavior of deep image classification models by synthesizing meaningful preferred stimuli. They report that the preferred stimuli for the class ‘dumbbell’ would indeed contain a visual rendering of a dumbbell, but the latter would systematically come with an arm attached to it [103], demonstrating that the output neurons do not only fire for the object of interest but also for correlated features.

Another surprising finding was reported in [82]. Here, interpretability—more precisely the ability to determine which pixels are being used for prediction—helped to reveal that the best performing ML model in a prestigious international competition, namely the PASCAL visual object classification (VOC) challenge, was actually relying partly on artefacts. The high performance of the model on the class “horse” could indeed be attributed to detecting a copyright tag present in the bottom left corner of many horse images of the dataset555The presence of these artifacts in the benchmark dataset had gone unnoticed for almost a decade., rather than detecting the actual horse in the image. Other effects of similar type have been reported for other classes and datasets in many other works, e.g., in [116] models were shown to distinguish between the class “Husky” and “Wolf” solely based on the presence or absence of snow in the background.

These discoveries have been made rather accidentally by researchers carefully analysing suspicious explanations. It is clear that such laborious manual inspection of heatmaps does not scale to big datasets with millions of examples. Therefore, systematic approaches to the interpretation of ML models have recently gained increased attention.

Viii-a Systematic Interpretation of ML Models on Big Data

This section describes two examples of a systematic analysis of a large number of heatmaps. In the first case, the goal of the analysis is to systematically find data artefacts picked up by the model (e.g., copyright tags in horse images), whereas the second analysis aims to carefully investigate the learning process of a deep model, in particular the emergence of novel prediction strategies during training.

The process of systematically extracting data artefacts was automated by a method called Spectral Relevance Analysis (SpRAy) [85], where after computing LRP-type explanations on a whole dataset (cf. Section VI-B), a cluster-based analysis was applied on the collection of produced explanations to extract prototypical decision behaviors. The SpRAy analysis would for example reveal for some shallow Fisher Vector model trained on Pascal VOC 2007 dataset that image of the ‘horse’ would be predicted as such using a finite number of prototypical decision behaviors ranging from detecting the horse itself to detecting weakly related features such as horse racing poles, or clear artefacts such as copyright tags [82]. The analysis was later on applied to the decisions of a state-of-the-art VGG-16 deep neural network classifier trained on ImageNet, and here again, interesting insight about the decision structure could be identified [6]. For example, certain predictions, e.g. for the class ‘garbage truck’ could be found by SpRAy to rely on some watermark in the bottom-left corner of the image (see Fig. 16). This watermark which is only present in specific images would thus be used by the model as a confounding factor (or artefact) to artificially improve prediction accuracy on this benchmark666Or in the case of [6] deteriorate model performance, as the identified confounding feature is exclusive to the training data..

Fig. 16: SpRAy analysis of the decision behavior of a pretrained VGG-16 model on images of the class ‘garbage truck’. Top

: Low-dimensional embedding of the explained decisions for the class ‘garbage truck’. Points highlighted in red are outliers.

Bottom: Images and corresponding decisions for some of the points highlighted in red.

Such behavior of the ML classifier can be referred to as ‘Clever Hans’ behavior [85]. For machine learning models having implemented a Clever Hans behavior, an overconfident assessment of the true model accuracy would be produced by solely relying on the accuracy metric without an inspection of the model’s decision structure. The model would have likely performed erratically once it is applied in a real-world setting, where, e.g., the copyright tag is decoupled from the concept of a horse or garbage truck respectively. Here, the ability to explain the decision-making of the model and to automatically analyse these explanations on a very large dataset, was therefore a key ingredient to more robustly assess the model’s strength and weakness and potentially improving it.

Another example of a systematic interpretation of ML models can be found in the context of reinforcement learning, in particular board and video games. Here large amounts of data can be easily generated (simulated games) and used to carefully analyse the strategies of a ML model and how these strategies emerge during training. On games such as the arcade game Atari Breakout, the computer player would progressively learn strategies commonly employed by human players such as ‘tunnel-digging’

[96, 158]. The work of [85] analyzes the emergence of this advanced ‘tunnel-digging’ technique using interpretatable ML. First, LRP-type pixel-wise explanations the player’s decision were produced at various time steps and training stages. The produced collection of explanations were then be pooled (cf. Section VI-B1) on bounding boxes representing some key visual elements of the game, specifically, the ball, the paddle, and the tunnel. Pooled quantities could then be easily and quantitatively monitored over the different stages of training. The analysis is shown in Fig. 17.

Fig. 17: Analysis of the learning process of a deep model playing Atari Breakout. The curves show the development of the relative relevance of different game objects (ball, paddle, tunnel) averaged over six runs.

We observe that the neural network model would first learn to play conventionally by keeping track of the ball and the paddle, and only at a later stage of the training process would learn to focus on the tunnel area, allowing the ball to go past the wall and bounce repeatedly in the top area of the screen. This analysis highlights in a way that is easy interpretable to the human the multi-stage nature of learning, in particular, how the learning machine progressively develops increasingly sophisticated game playing strategies. Overall, this summarized information on the decision structure of the model and on the evolution of the learning process could prove to be crucial in learning improved models on purposely consolidated datasets. They could also prove useful for characterizing the different stages of learning and developing more efficient training procedures.

Viii-B Interpretable Deep Models in the Sciences

In the last subsection we demonstrated the use of explanation techniques for systematically analysing models and verifying that they have learned valid and meaningful prediction strategies. Once verified to not be Clever Hans predictors, non-linear models offer a lot of potential for the sciences to detect new interesting patterns in the data, which may lead to an improved understanding of the underlying natural structures and processes — the primary goal of scientists. So far this was not possible, because non-linear models were actually considered to be “black boxes”, i.e., scientists had to resort to the use of linear models (see e.g. [51, 93]), even if this came at the expense of predictivity. In the following we will show that interpretation methods remove this restriction and bring the full potential of non-linear methods to scientific disciplines.

Let us start with the discussion of scientific problems, which concern images, thus can directly benefit from the advances made in image-based machine learning in the last years. Figure 18 (a) shows such an example: the task of predicting tissue type from histopathology imagery. The work of [20] addresses this problem using interpretable machine learning, more precisely it proposes an interpretable bag-of-words prediction pipeline with invariances to rotation, shift and scale of the input data777Note that recently interpretable deep neural networks have also been used for this task, e.g., [50].. For the verification of the prediction results, relevance maps are computed, offering per-pixel scores which indicate the presence of tumorous structures. Figure 18 (a) demonstrates how LRP heatmaps computed for different target cell types can be combined for obtaining computationally predicted fluorescence images. The explanations are histopathologically meaningful and may potentially give interesting information about which tissue components are most indicative of cancer. Further analyses such as the identification, localization and counting of cells, i.e., lymphocytes, can be performed on these explanations (see [74]). In addition to visual explanations, [161] also generate a free-text pathology report to clarify the decision of the classifier.

Fig. 18: Different applications of interpretable machine learning in the sciences. (a) LRP heatmaps merged into a computationally predicted fluorescence image. Here, red identifies cancer, green shows lymphocites and blue is stroma. Adapted from [20]. (b) Example of LRP relevance maps for a single EEG trial of an imagined movement (each class). The matrices indicate the relevance of each time point (abscissa) and EEG channel (ordinate). Below the matrix the relevance information for two single time points is plotted as a scalp topography. Adapted from [140]. (c) A whole-brain fMRI volume is decoded using a deep neural network and the decoding decision is explained in terms of voxel-wise relevance using LRP, localizing brain areas corresponding to the predicted cognitive state. Adapted from [146]. (d)

Attributions are assigned to the atoms by a graph convolutional neural network and the classification decisions wrt. the molecule’s mutagenicity are explained. Interpretability feedback reveals that the model has correctly identified molecular substructures known to interact with (human) DNA. Adapted from

[114]. (e) The predicted atom score describing protein-ligand interaction is explained with CLRP (green corresponding to a more favorable score). Adapted from [55]. (f) Visualization of the filter weights learned on the first convolutional layer of a deep neural network trained for galaxy morphology classification. Adapted from [164].

Various other works apply interpretable machine learning to image-based analyses in the sciences, especially in medical applications [57]. For instance, [79] use deep multiple instance learning to classify and segment microscopy images using only whole image level annotations. The work of [39]

introduces a model-agnostic interpretation method for the analysis of x-ray images, which not only visualizes the elements that have contributed to each decisions, but also produces descriptive sentences to clarify the decision of the classifier. The combined explanations are well adopted by doctors and are shown to be more informative than the visualisations or generated text alone. Interpretable and non-linear models have been successfully applied to many other tasks, including the detection of lesions in diabetic retinopathy data

[115], the validation of predictions in dermatology [157], plant stress classification [41], or the analysis of galaxy morphologies [164]. The latter work aims to classify galaxy morphologies into five classes (completely round smooth, in-between smooth, cigar-shaped smooth, edge-on and spira) using a convolutional neural networks, and the convolution filters as well as activation patterns are analysed to gain insights into the features learned by the model to solve this task (see Fig. 18 (f)).

Interpretable ML methods have also demonstrated their potential beyond the image domain, e.g., on scientific problems concerning time series data. For instance, the work of [140] presents one of the first uses of interpretable deep neural networks in cognitive neurosciences, specifically in brain computer interfacing [33] where linear methods are still the most widely used filtering methods [22, 51]. The results in [140] show that deep models achieve similar decoding performances888Deep models usually require larger amounts of training data to have an advantage over linear techniques. and learn neurophysiologically plausible patterns (see Fig. 18 (b)), namely focus on the contralateral sensorimotor cortex – an area where the event-related desynchronization occurs during motor imagery. However, in contrast to the patterns computed with conventional approaches [22, 51], which only allow to visualize the aggregated information (average activity) per class, the explanations computed with LRP are available for every single input of the deep learning classifier, i.e., for every time point of individual trials (see Fig. 18 (b)). This increased resolution (knowing which sources are relevant at each time point, instead of just having the average patterns) may contribute to a better understanding of cognitive processes in the brain.

Another application of interpretable machine learning in cognitive neuroscience is presented in [146], which applies deep learning to whole-brain fMRI data. The method, termed DeepLight, outperforms well-established local or linear decoding methods such as the generalized linear model and searchlight (see [146]). An adaption of LRP maintains interpretability and verifies that the model’s predictions are based on physiologically appropriate brain areas for the classified cognitive states. Figure 18 (c) visualizes exemplar voxels, which are used by the deep model to accurately decode the state from the fMRI signal. These voxels of high relevance have been shown to correspond very well to the active areas described in the fMRI literature (see [146]). Note that also here the deep model not only gives an advantage in terms of performance (i.e., better decoding accuracy) compared to the local or linear baseline methods, but its explanations are provided for every single input, i.e., for every fMRI volume over time. This increased resolution allows to study the spatio-temporal dynamics of the fMRI signal and its impact on decoding, something which is not possible with classical decoding methods999In classical fMRI analyses, p-values indicate the relevance of brain voxels. However, these p-values are usually obtained on a subject or group-level, not for single trials or single time points..

Many other studies use explanation methods to analyse time series signals in the sciences. For instance, [60] introduce interpretable machine learning methods to the domain of human gait recognition and show that non-linear learning models are not only the better predictors but that they can at the same time learn physiologically meaningful features for subject prediction which align with expected features used by linear models. Another work [78]

applies Long Short-Term Memory (LSTM) networks to the field of hydrology to predict the river discharge from meteorological observations. The authors apply the integrated gradients technique to analyse the internals of the network and obtain insights which are consistent with our understanding of the hydrological system.

Structured data such as molecules or gene sequences are another very important domain for scientific investigations. Therefore, interpretable and non-linear ML methods have also attracted attention in scientific communities working with this type of data. One successful example of the use of interpretable ML methods in this domain has been reported in [114]. The authors train a deep model to predict molecular properties and bioactivities and report interesting insights when analysing what the model has learned (see Fig. 18 (d)). For instance, they show that single neurons play the role of pharmacophore detectors and demonstrate that the model uses pharmacophore-like features to reach its conclusions, which are consistent with pharmacologist literature. Another work [55] (see Figure 18 (e)) applies an extended version of LRP called CLRP to visualize how CNNs interpret individual protein-ligand complexes in molecular modeling. Also here the trained model learns meaningful features and has the ability to provide new insights into the mechanisms underlying protein-ligand interactions. Yet another work [155] applies LSTM predictors together with LRP for transparent therapy prediction on patients suffering from metastatic breast cancer. Clinical experts verify that the features used for prediction as revealed via LRP largely agree with established clinical guidelines and knowledge. The work by [69] uses interpretable ML to understand the activity prediction across chromosomes, whereas [27] uses these methods for understanding automated decisions on behavioral biometrics. Recently, also the physics community started to use interpretable machine learning for the task of energy prediction. The work of [127, 128] showed that accurate predictions are possible and obtained also physical meaningful insights from the model. Other works [91] showed that explanations in gene analysis lead to interpretable patterns consistent with literature knowledge.

Ix Challenges and Outlook

While recent years have seen astonishing conceptual and technical progress in XAI, it is important to carefully discuss the current limits and the challenges that will need to be addressed by researchers to further establish the field and increase the usefulness of XAI systems.

Foundational theoretical work in XAI has so far been limited. As discussed above in Section V, early works have established Taylor expansions and Deep Taylor Decomposition [99] as principled frameworks for describing the process of explanation. Other frameworks such as Shapley values [131, 92] or rate distortion theory [94] have also emerged as ways of formalizing the task of explanation. Numerous theoretical questions however remain: For example, it remains unclear how to weigh the model and the data distribution into the explanation, in particular, whether an explanation should be based on any features the model locally reacts to, or only those that are expressed locally. Related to this question is that of causality, i.e. assuming a causal link between two input variables, it has not been answered yet whether the two variables, or only the source variable, must constitute the explanation. A deeper formalization and theoretical understanding of XAI will be instrumental for shedding light into these important questions.

Another central question in XAI is that of optimality of an explanation. So far, there is no well-agreed understanding of what should be an optimal explanation. Also, ground-truth explanations cannot be collected by humans as this would presuppose they are able to make sense of the complex ML model they would like to explain in the first place. Methods such as ‘pixel-flipping’ [122] assess explanation quality indirectly by testing how flipping relevant pixels affects the output score. The ‘axiomatic approach’ [142, 97] does not have this indirect step, however, axioms are usually too generic to evaluate an explanation comprehensively. The question of evaluating and comparing explanations becomes even more complex when integrating human factors such as interpretability, manageability, and overall utility of the XAI system [116, 105]. Application-driven evaluations account for those factors, however, they are also hard to implement in practice [34].

Further challenges arise when applying XAI on problems where different actors (e.g. the explainer and the explainee) have conflicting interests. Recent work has shown that an ‘adversary’ can modify the ML model in an imperceptible fashion so that the prediction behavior remains intact but the explanation of those predictions changes drastically [53]. Relatedly, even when the model remains unchanged, inputs could be perturbed imperceptibly to produce arbitrary explanations [32]

. Interpretability may also find itself at odds with the constant quest for higher predicting accuracy. Because highly predictive models are typically complex and strongly engineered, XAI software must keep up with this ever increasing complexity

[3], and at the same time, the human must also deal with explanations of increasingly subtler predictions. When designing new XAI-driven applications, adopting a holistic view that sets the right tradeoffs and delivers the optimal amount of information and range of action to the multiple and potentially conflicting actors, will constitute an important practical challenge.

Another question of utmost importance, especially, in safety critical domains, is whether we can fully trust the model after having explained some predictions. Here, we need to distinguish between model interpretation and model certification: While it is helpful to explain models for available input data, e.g. interpretable ML can detect erroneous decision strategies, certification would require to verify the model for all possible inputs, not only those included in the data. Furthermore, it must be remembered that explanations returned to the user are summaries of a potentially complex decision process, i.e. there may be different decision strategies, the wrong ones and the correct ones, mapping to the same explanation. Lastly, explanations are subject to their own biases and approximations, and they can be manipulated by an adversary to loose their informative content. Therefore, in order to ultimately establish a truly safe and trustworthy model, further steps are needed, potentially including the use of formal verification methods [19, 66].

Finally, it may be worthwhile to explore new forms of explanations that are optimally suited to its user. Such explanations could for example leverage the user’s prior knowledge or personal preferences. Novel approaches from knowledge engineering, cognitive sciences, and human-computer interfaces, will need to contribute. Also, while heatmaps provide a first intuition to users, they may not take advantage of the complex abstract reasoning capabilities of humans. An example would be to replace heatmaps by ‘mathematical formulas’ explaining the ML decision behavior. For example, the local extraction of polynomials or other interaction models would enable higher order explanations, specifically the automatic grouping of variables that jointly and combined nonlinearly constitute an explanation

[148, 28]. In the neurosciences, von der Malsburg has coined the concept of ‘binding’, neural strategies that allow sets of variables (neurons) to synchronize collectively by learning [152]. In physics collective variables have been so far conceptualized manually giving rise to groundbreaking advances in solid state physics defining quasiparticles such as phonons, plasmons, polarons, magnons, exitons [73], etc. Ideally, collective variables in this sense would in the future be inferred from a learning model by e.g. automatically binding explanation variables in meaningful abstract ways.

X Conclusion

Complex nonlinear ML models such as neural networks or kernel machines have become game changers in the sciences and industry. Fast progress in the field of explainable AI, has made virtually any of these complex models, supervised or unsupervised, interpretable to the user. Consequently, we no longer need to give up predictivity in favor of interpretability, and we can take full advantage of strong nonlinear machine learning in practical applications.

In this review we have made the attempt to provide a systematic path to bring XAI to the attention of an interested readership. This included an introduction to the technical foundations of XAI, a presentation of practical algorithms such as Occlusion, Integrated Gradients and LRP, concrete examples illustrating how to use explanation techniques in practice, and a discussion of successful applications. We would like to stress that the techniques introduced in this paper can be readily and broadly applied to the workhorses of supervised and unsupervised learning, e.g. clustering, anomaly detection, kernel machines, deep networks, as well as state-of-the-art pretrained convolutional networks and LSTMs.

XAI techniques not only shed light into the inner workings of non-linear learning machines, explaining why they arrive at their successful predictions; they also help to discover biases and quality issues in large data corpora with millions of examples [6]. This is an increasingly relevant direction since modern machine learning relies more and more on reference datasets and reference pretrained models. Furthermore, initial steps have been taken to use XAI beyond validation to arrive at better and more predictive models e.g. [119, 7, 8, 6].

We would like to stress the importance of XAI, notably in safety critical operations such as medical assistance or diagnosis, where the highest level of transparency is required in order to avoid fatal outcomes.

Finally as a versatile tool in the sciences, XAI has been allowing to gain novel insights (e.g. [127, 20, 55, 145, 36, 123, 112]) ultimately contributing to further our scientific knowledge.

While XAI has seen an almost exponential rise in interest (and progress) with communities forming and many workshops emerging, there is a wealth of open problems and challenges with ample opportunities to contribute (see Section IX). Concluding, we firmly believe that XAI will in the future become an indispensable practical ingredient to obtain improved, transparent, safe, fair and unbiased learning models.


This research was supported in part by the Institute for Information & Communications Technology Promotion and funded by the Korea government (MSIT) (No. 2017-0-01779), and was partly supported by the German Ministry for Education and Research (BMBF) under Grants 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A and 01IS18037A; the German Research Foundation (DFG) under Grant Math+, EXC 2046/1, Project ID 390685689. Correspondence to WS, GM, KRM.

Appendix A Implementing Smooth Integrated Gradients

In this appendix, we give the algorithm combining SmoothGrad [137] and Integrated Gradients [142], which we use in Section IV in our comparison of explanation methods. Its implementation is shown in Algorithm 5.

  for  do
     for  do
     end for
  end for
Algorithm 5 Integrated Gradients with Smoothing

The procedure consists of a simple nested loop of smoothing and integration steps, where each integration starts at some random location near the origin. Here, we note that these locations are not strict root points. However, in the context of image data, random noise does not change significantly evidence in favor or against a particular class. Thus, the explanation remains approximately complete.

Appendix B Implementing Layer-wise Relevance Propagation

In this appendix, we outline two possible implementations of LRP [13, 98]. A first one that is intuitive and based on looping forward and backward over the multiple layers of the neural network. This procedure can be applied to simple sequential structures such as VGG-16 [136]. The second approach we present is based on ‘forward hooks’ and serves to extend the LRP method to more complex architectures such as ResNet [52].

B-a Standard Sequential Implementation

The standard implementation is based on the forward-backward procedure outlined in Algorithm 2. We focus here on the relprop function of this procedure, which is called at each layer to propagate relevance to the layer below. We give an implementation for the LRP- rules [13, 98] and one for the -rule [99]. The first three rules can be seen as special cases of the more general rule

where . This propagation rule can be computed in four steps.

   (Step 1)
   (Step 2)
   (Step 3)
   (Step 4)
Algorithm 6 LRP-

The first step applies , a forward evaluation of a copy of the layer whose parameters have gone through some function , and also adds a small positive term . The third step is conveniently expressed as a gradient of some dot product w.r.t. the input activations. The notation indicates that the term has been detached from the gradient computation and is therefore treated as a constant. In PyTorch, for example, this can be achieved by calling ().data. The relprop function implemented by Algorithm 6 is applicable for most linear and convolution layers of a deep rectifier network. For the pixel-layer, we use instead the -rule [99, 98]:

where and are the lowest/highest possible pixel values of . The corresponding implementation is shown in Algorithm 7 and again consists of four steps:

   (Step 1)
   (Step 2)
   (Step 3)
   (Step 4)
Algorithm 7 -rule

The functions and are forward passes on copies of the first layer whose parameters have been processed by the functions and respectively.

B-B Forward-Hook Implementation

When the architecture has non-sequential components (e.g. ResNet [52]), it is more convenient to reuse the graph traversing procedures readily implemented by the model’s existing forward pass and the automatically generated gradient propagation pass. To achieve this, we can implement ‘forward hooks’ at each linear and convolution layers. In this case, we leverage the ‘smooth gradient’ view of LRP (cf. Eq. (3)) and modify the implementation of the forward pass in a way that it keeps the forward pass functionally equivalent but modifies the local gradient computation. This is achieved by strategically detaching terms from the gradient in a way that calling the gradient becomes equivalent to computing Eq. (3) at each layer. Once the forward functions have been redefined at each layer, the explanation can be computed globally by calling the gradient of the whole function as shown in Algorithm 8. (Note that unlike the original function the new function that includes the hooks receives three arguments as input: the data point , and the bounds and used by the first layer.)

Forward hook for intermediate layers (LRP-)


Forward hook for the first layer (-rule)


Global LRP computation

Algorithm 8 LRP implementation based on forward hooks

The forward-hook implementation produces exactly the same output as the original function , but its ‘gradient’, which we denote by is no longer the same due to the detached terms. As a result, calling the gradient of this function, and recombining it with the input yields the same desired LRP explanation as one would get with the standard LRP implementation, but has now gained applicability to a broader set of neural network architectures.

Appendix C Explanation Software

The attention to interpretability in machine learning has grown frantically throughout the past decade alongside research on, and the development of computationally efficient deep learning frameworks. This attention in turn caused a strong demand for accessible and efficient software solutions for out-of-the-box applicability of XAI. In this section we briefly highlight a collection of software toolboxes released in recent years, providing convenient access to a plethora of methods of XAI and supporting various computational backends. A summarizing overview over the presented software solutions is given in Table I, alongside a glossary of methods with respective abbreviations used throughout our review in Table II.

Software Package Release Available from Compute Backend GPU Support Methods
LRP Toolbox [83] 2016 sebastian-lapuschkin/lrp_toolbox Caffe DCN, DTD, GB, LRP, SA
numpy/cupy LRP, SA
Matlab LRP, SA
DeepExplain [4] 2017 marcoancona/DeepExplain Keras+TensorFlow DLR, IG, LRP-
iNNvestigate [3] 2019 albermax/innvestigate Keras+TensorFlow DCN, DL, DTD, GB,
Perturbation Analysis
TorchRay [37] 2019 facebookresearch/TorchRay PyTorch DCN, EB, EP
various benchmarks
Captum [76] 2019 (beta) pytorch/captum PyTorch DCN, DLR, DLSHAP,
TABLE I: Interpretability software packages by time of release
Method Abbrv. ApproShapley (Shapley Value Sampling) [25] AS Class Activation Mapping [163] CAM Contextual Prediction Difference Analysis [46] CPDA DeconvNet [159] DCN DeepLIFT [132] DL DeepLIFT (Rescale) [132] DLR DeepLIFT SHAP [92] DLSHAP Deep Taylor Decomposition [99] DTD ExcitationBackprop [160] EB ExtremalPerturbation [37] EP GradCAM [129] GC Gradient SHAP [92] GSHAP GradientInput [132] GI GuidedBackprop [139] GB Guided GradCam [129] GGC Integrated Gradients [142] IG Internal Influence [88] IG LayerConductance [134] LC Layer-wise Relevance Propagation (full) [13] LRP LRP (composite strategy) [84, 98] LRP-CMP Method Abbrv. LRP (specific variants) [13, 98] LRP- Local Interpretable Model-agnostic Explanations [116] LIME Meaningful Perturbation [38] MP NeuronConductance [30] NC NeuronGuidedBackprop [139] NGB NeuronIntegratedGradients [134] NIG Occlusion Analysis [159] OCC PatternAttribution [72] PA PatternNet [72] PN Prediction Difference Analysis [165] PDA Randomized Input Sampling for Explanation [113] RISE Saliency Analysis / Gradient [14, 135] SA SHapley Additive exPlanations [92] SHAP SmoothGrad [137] SG SmoothGrad [59] SG-SQ Spectral Relevance Analysis [85] SpRAy VarGrad [1] VG Testing with Concept Activation Vectors [71] TCAV TotalConductance [30] TC
TABLE II: Glossary of interpretability methods with abbreviations referenced throughout our review

One of the earlier and comprehensive XAI software packages is the LRP Toolbox [83], providing presently up to date implementations of LRP for the — until very recently — popular Caffe deep learning framework [65], as well as Matlab and Python via custom neural network interfaces. While support for Caffe is restricted to the C++ programming language and thus to CPU hardware, it provides functionality implementing DCN, GB, DTD, and SA and can be built and used as a stand-alone executable binary for predictors based on the Caffe neural network format. The sub-packages available for Matlab and Python provide out-of-the-box support for LRP and SA, while being easily extensible via custom neural network modules written with clarity and the methods’ intelligibility in mind. The cupy [109] backend constitutes an alternative to the CPU-bound numpy [110] package, providing optional support for modern GPU hardware from NVIDIA.

Both the DeepExplain [4] and iNNvestigate [3] toolboxes built on top of the popular Keras [26]

package for Python with TensorFlow backend for explaining Deep Neural Network models, and thus provide support for both CPU and GPU hardware and convenient access for users of Keras models. While the more recent iNNvestigate Toolbox implements a superset of the modified backpropagation methods available in DeepExplain, the latter also offers functionalty for perturbation-based attribution methods, i.e. the Occlusion method 

[159] and Shapley Value Resampling [25]. For explaining a model’s prediction DeepExplain allows for an ad-hoc selection of the explanation method via pythonic context managers. The iNNvestigate package on the other hand operates by attaching and automatically configuring (several) modified backward graphs called “analyzers” to a model of interest — one per XAI method to compute attributions with.

A present trend in the machine learning community is a migration to the PyTorch framework with its eager execution paradigm, away from other backends. Both the TorchRay [37] and Captum [76] packages for Python and PyTorch enable the use of interpretability methods for neural network models defined in context of PyTorch’s high level neural network description modules. Captum can be understood as a rich selection of XAI methods based on modified backprop and is part of the PyTorch project itself. While not as extensive as Captum, the TorchRay package offers a series benchmarks for XAI alongside its selection of (benchmarked) interpretability methods.


  • [1] J. Adebayo, J. Gilmer, I. J. Goodfellow, and B. Kim. Local explanation methods for deep neural networks lack sensitivity to parameter values. In International Conference on Learning Representations (ICLR), 2018.
  • [2] C. Agarwal, D. Schonfeld, and A. Nguyen. Removing input features via a generative model to explain their attributions to classifier’s decisions. CoRR, abs/1910.04256, 2019.
  • [3] M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K. T. Schütt, G. Montavon, W. Samek, K.-R. Müller, S. Dähne, and P.-J. Kindermans. iNNvestigate neural networks! Journal of Machine Learning Research, 20:93:1–93:8, 2019.
  • [4] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. CoRR, abs/1711.06104, 2017.
  • [5] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference of Learning Representations (ICLR), 2018.
  • [6] C. J. Anders, T. Marinč, D. Neumann, W. Samek, K.-R. Müller, and S. Lapuschkin. Analyzing imagenet with spectral relevance analysis: Towards imagenet un-hans’ed. CoRR, abs/1912.11425, 2019.
  • [7] C. J. Anders, G. Montavon, W. Samek, and K.-R. Müller. Understanding patch-based learning of video data by explaining predictions. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pages 297–309. 2019.
  • [8] J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, J. Brandstetter, and S. Hochreiter. Rudder: Return decomposition for delayed rewards. In Advances in Neural Information Processing Systems, pages 13544–13555, 2019.
  • [9] L. Arras, J. A. Arjona-Medina, M. Widrich, G. Montavon, M. Gillhofer, K.-R. Müller, S. Hochreiter, and W. Samek. Explaining and interpreting lstms. In Explainable AI, volume 11700 of Lecture Notes in Computer Science, pages 211–238. Springer, 2019.
  • [10] L. Arras, F. Horn, G. Montavon, K.-R. Müller, and W. Samek. ”What is relevant in a text document?”: An interpretable machine learning approach. PLoS ONE, 12(8):e0181142, 2017.
  • [11] L. Arras, G. Montavon, K.-R. Müller, and W. Samek.

    Explaining recurrent neural network predictions in sentiment analysis.

    In Proceedings of the EMNLP’17 Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA), pages 159–168, 2017.
  • [12] A. B. Arrieta, N. Díaz-Rodríguez, J. D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, and F. Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58:82–115, June 2020.
  • [13] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):e0130140, 2015.
  • [14] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 2010.
  • [15] D. Balduzzi, M. Frean, L. Leary, J. P. Lewis, K. W. Ma, and B. McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML, volume 70 of Proceedings of Machine Learning Research, pages 342–350. PMLR, 2017.
  • [16] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba. Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1901.09887, 2019.
  • [17] S. Bazen and X. Joutard. The Taylor decomposition: A unified generalization of the Oaxaca method to nonlinear models. Working papers, HAL, 2013.
  • [18] S. Becker, M. Ackermann, S. Lapuschkin, K.-R. Müller, and W. Samek. Interpreting and explaining deep neural networks for classification of audio signals. CoRR, abs/1807.03418, 2018.
  • [19] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pages 908–918, 2017.
  • [20] A. Binder, M. Bockmayr, M. Hägele, S. Wienert, D. Heim, K. Hellweg, A. Stenzinger, L. Parlow, J. Budczies, B. Goeppert, D. Treue, M. Kotani, M. Ishii, M. Dietel, A. Hocke, C. Denkert, K.-R. Müller, and F. Klauschen. Towards computational fluorescence microscopy: Machine learning-based integrated prediction of morphological and molecular tumor profiles. CoRR, abs/1805.11178, 2018.
  • [21] C. M. Bishop.

    Neural Networks for Pattern Recognition

    Oxford University Press, Inc., USA, 1996.
  • [22] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K.-R. Müller. Optimizing spatial filters for robust eeg single-trial analysis. IEEE Signal processing magazine, 25(1):41–56, 2008.
  • [23] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
  • [24] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1721–1730, 2015.
  • [25] J. Castro, D. Gómez, and J. Tejada. Polynomial calculation of the shapley value based on sampling. Computers & Operations Research, 36(5):1726–1730, 2009.
  • [26] F. Chollet et al. Keras., 2015.
  • [27] P. Chong, Y. X. M. Tan, J. Guarnizo, Y. Elovici, and A. Binder. Mouse authentication without the temporal aspect – what does a 2d-cnn learn? In 2018 IEEE Security and Privacy Workshops (SPW), pages 15–21. IEEE, 2018.
  • [28] T. Cui, P. Marttinen, and S. Kaski. Recovering pairwise interactions using neural networks. CoRR, abs/1901.08361, 2019.
  • [29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
  • [30] K. Dhamdhere, M. Sundararajan, and Q. Yan. How important is a neuron? CoRR, abs/1805.12233, 2018.
  • [31] I. S. Dhillon, Y. Guan, and B. Kulis.

    Kernel k-means: spectral clustering and normalized cuts.

    In KDD, pages 551–556. ACM, 2004.
  • [32] A. Dombrowski, M. Alber, C. J. Anders, M. Ackermann, K.-R. Müller, and P. Kessel. Explanations can be manipulated and geometry is to blame. In NeurIPS, pages 13567–13578, 2019.
  • [33] G. Dornhege, J. d. R. Millán, T. Hinterberger, D. McFarland, K.-R. Müller, et al. Toward brain-computer interfacing, volume 63. MIT press Cambridge, MA, 2007.
  • [34] F. Doshi-Velez and B. Kim. A roadmap for a rigorous science of interpretability. CoRR, abs/1702.08608, 2017.
  • [35] E. Eidinger, R. Enbar, and T. Hassner. Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security, 9(12):2170–2179, 2014.
  • [36] H. J. Escalante, S. Escalera, I. Guyon, X. Baró, Y. Güçlütürk, U. Güçlü, and M. van Gerven. Explainable and interpretable models in computer vision and machine learning. Springer, 2018.
  • [37] R. Fong, M. Patrick, and A. Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2950–2958, 2019.
  • [38] R. C. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3449–3457, 2017.
  • [39] W. Gale, L. Oakden-Rayner, G. Carneiro, L. J. Palmer, and A. P. Bradley. Producing radiologist-quality reports for interpretable deep learning. In 16th IEEE International Symposium on Biomedical Imaging, ISBI 2019, Venice, Italy, April 8-11, 2019, pages 1275–1279, 2019.
  • [40] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept drift adaptation. ACM Comput. Surv., 46(4):44:1–44:37, 2014.
  • [41] S. Ghosal, D. Blystone, A. K. Singh, B. Ganapathysubramanian, A. Singh, and S. Sarkar. An explainable deep machine vision framework for plant stress phenotyping. Proceedings of the National Academy of Sciences, 115(18):4613–4618, 2018.
  • [42] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, volume 15 of JMLR Proceedings, pages 315–323., 2011.
  • [43] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
  • [44] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR (Poster), 2015.
  • [45] B. Goodman and S. R. Flaxman. European union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine, 38(3):50–57, 2017.
  • [46] J. Gu and V. Tresp. Contextual prediction difference analysis. CoRR, abs/1910.09086, 2019.
  • [47] J. Gu, Y. Yang, and V. Tresp. Understanding individual decisions of cnns via contrastive backpropagation. In ACCV (3), volume 11363 of Lecture Notes in Computer Science, pages 119–134. Springer, 2018.
  • [48] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM Comput. Surv., 51(5):93:1–93:42, 2019.
  • [49] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social GAN: socially acceptable trajectories with generative adversarial networks. In CVPR, pages 2255–2264. IEEE Computer Society, 2018.
  • [50] M. Hägele, P. Seegerer, S. Lapuschkin, M. Bockmayr, W. Samek, F. Klauschen, K.-R. Müller, and A. Binder. Resolving challenges in deep learning-based analyses of histopathological images using explanation methods. CoRR, abs/1908.06943, 2019.
  • [51] S. Haufe, F. C. Meinecke, K. Görgen, S. Dähne, J. Haynes, B. Blankertz, and F. Bießmann. On the interpretation of weight vectors of linear models in multivariate neuroimaging. NeuroImage, 87:96–110, 2014.
  • [52] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
  • [53] J. Heo, S. Joo, and T. Moon. Fooling neural network interpretations via adversarial model manipulation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 2921–2932, 2019.
  • [54] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [55] J. Hochuli, A. Helbling, T. Skaist, M. Ragoza, and D. R. Koes. Visualizing convolutional neural network protein-ligand scoring. Journal of Molecular Graphics and Modelling, 2018.
  • [56] A. Holzinger. From machine learning to explainable ai. In 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA), pages 55–66, 2018.
  • [57] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, and H. Müller. Causability and explainability of artificial intelligence in medicine. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(4):e1312, 2019.
  • [58] S. Hong, D. Yang, J. Choi, and H. Lee. Interpretable Text-to-Image Synthesis with Hierarchical Semantic Layout Generation, pages 77–95. Springer International Publishing, Cham, 2019.
  • [59] S. Hooker, D. Erhan, P.-J. Kindermans, and B. Kim. A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, pages 9734–9745, 2019.
  • [60] F. Horst, S. Lapuschkin, W. Samek, K.-R. Müller, and W. I. Schöllhorn. Explaining the unique nature of individual gait patterns with deep learning. Scientific Reports, (9):2391, 2019.
  • [61] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.

    Labeled faces in the wild: A database for studying face recognition in unconstrained environments.

    Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
  • [62] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.
  • [63] B. K. Iwana, R. Kuroki, and S. Uchida. Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation. CoRR, abs/1908.04351, 2019.
  • [64] M. H. Jarrahi. Artificial intelligence and the future of work: Human-AI symbiosis in organizational decision making. Business Horizons, 61(4):577–586, July 2018.
  • [65] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678, 2014.
  • [66] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In CAV (1), volume 10426 of Lecture Notes in Computer Science, pages 97–117. Springer, 2017.
  • [67] J. Kauffmann, M. Esders, G. Montavon, W. Samek, and K.-R. Müller. From clustering to cluster explanations via neural networks. CoRR, abs/1906.07633, 2019.
  • [68] J. Kauffmann, K.-R. Müller, and G. Montavon. Towards explaining anomalies: A deep Taylor decomposition of one-class models. Pattern Recognition, 101:107198, May 2020.
  • [69] D. R. Kelley, Y. Reshef, M. Bileschi, D. Belanger, C. Y. McLean, and J. Snoek. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome research, pages gr–227819, 2018.
  • [70] J. Khan, J. S. Wei, M. Ringnér, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6):673–679, June 2001.
  • [71] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler, F. B. Viégas, and R. Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, volume 80 of Proceedings of Machine Learning Research, pages 2673–2682. PMLR, 2018.
  • [72] P.-J. Kindermans, K. T. Schütt, M. Alber, K.-R. Müller, D. Erhan, B. Kim, and S. Dähne. Learning how to explain neural networks: Patternnet and patternattribution. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  • [73] C. Kittel. Introduction to solid state physics, volume 8. Wiley New York, 2004.
  • [74] F. Klauschen, K.-R. Müller, A. Binder, M. Bockmayr, M. Hägele, P. Seegerer, S. Wienert, G. Pruneri, S. de Maria, S. Badve, et al. Scoring of tumor-infiltrating lymphocytes: From visual estimation to machine learning. Seminars in cancer biology, 52:151–157, 2018.
  • [75] M. Kohlbrenner, A. Bauer, S. Nakajima, A. Binder, W. Samek, and S. Lapuschkin. Towards best practice in explaining neural network decisions with LRP. CoRR, abs/1910.09840, 2019.
  • [76] N. Kokhlikyan, V. Miglani, M. Martin, E. Wang, J. Reynolds, A. Melnikov, N. Lunova, and O. Reblitz-Richardson. Pytorch captum., 2019.
  • [77] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13:8–17, 2015.
  • [78] F. Kratzert, M. Herrnegger, D. Klotz, S. Hochreiter, and G. Klambauer. NeuralHydrology – Interpreting LSTMs in Hydrology, pages 347–362. Springer International Publishing, Cham, 2019.
  • [79] O. Z. Kraus, L. J. Ba, and B. J. Frey. Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics, 32(12):52–59, 2016.
  • [80] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
  • [81] W. Landecker, M. D. Thomure, L. M. A. Bettencourt, M. Mitchell, G. T. Kenyon, and S. P. Brumby. Interpreting individual classifications of hierarchical networks. In IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2013, Singapore, 16-19 April, 2013, pages 32–38, 2013.
  • [82] S. Lapuschkin, A. Binder, G. Montavon, K.-R. Müller, and W. Samek. Analyzing classifiers: Fisher vectors and deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2912–2920, 2016.
  • [83] S. Lapuschkin, A. Binder, G. Montavon, K.-R. Müller, and W. Samek. The layer-wise relevance propagation toolbox for artificial neural networks. Journal of Machine Learning Research, 17(114):1–5, 2016.
  • [84] S. Lapuschkin, A. Binder, K.-R. Müller, and W. Samek. Understanding and comparing deep neural networks for age and gender classification. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), pages 1629–38, 2017.
  • [85] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller. Unmasking Clever Hans predictors and assessing what machines really learn. Nature Communications, 10(1):1096, 2019.
  • [86] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.
  • [87] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
  • [88] K. Leino, S. Sen, A. Datta, M. Fredrikson, and L. Li. Influence-directed explanations for deep convolutional networks. In 2018 IEEE International Test Conference (ITC), pages 1–8. IEEE, 2018.
  • [89] M. Lin, Q. Chen, and S. Yan. Network in network. In International Conference of Learning Representations (ICLR), 2014.
  • [90] Z. C. Lipton. The mythos of model interpretability. ACM Queue, 16(3):30, 2018.
  • [91] Y. Liu, K. Barr, and J. Reinitz. Fully interpretable deep learning model of transcriptional control. bioRxiv, 2019.
  • [92] S. M. Lundberg and S. Lee. A unified approach to interpreting model predictions. In NIPS, pages 4768–4777, 2017.
  • [93] S. Ma, X. Song, and J. Huang. Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics, 8, 2007.
  • [94] J. MacDonald, S. Wäldchen, S. Hauch, and G. Kutyniok. A rate-distortion framework for explaining neural network decisions. CoRR, abs/1905.11092, 2019.
  • [95] T. Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38, 2019.
  • [96] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [97] G. Montavon. Gradient-based vs. propagation-based explanations: An axiomatic comparison. In Explainable AI, volume 11700 of Lecture Notes in Computer Science, pages 253–265. Springer, 2019.
  • [98] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller. Layer-wise relevance propagation: An overview. In Explainable AI, volume 11700 of Lecture Notes in Computer Science, pages 193–209. Springer, 2019.
  • [99] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. Müller. Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognition, 65:211–222, 2017.
  • [100] G. Montavon, W. Samek, and K.-R. Müller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73:1–15, 2018.
  • [101] G. F. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, pages 2924–2932, 2014.
  • [102] N. Morch, U. Kjems, L. K. Hansen, C. Svarer, I. Law, B. Lautrup, S. Strother, and K. Rehm. Visualization of neural networks using saliency maps. In Proceedings of ICNN’95-International Conference on Neural Networks, volume 4, pages 2085–2090, 1995.
  • [103] A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: Going deeper into neural networks, 2015.
  • [104] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE transactions on neural networks, 12(2):181–201, 2001.
  • [105] M. Narayanan, E. Chen, J. He, B. Kim, S. Gershman, and F. Doshi-Velez. How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation. CoRR, abs/1802.00682, 2018.
  • [106] A. M. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In NIPS, pages 3387–3395, 2016.
  • [107] A. M. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, pages 427–436. IEEE Computer Society, 2015.
  • [108] A. M. Nguyen, J. Yosinski, and J. Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. CoRR, abs/1602.03616, 2016.
  • [109] R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Loomis. Cupy: A numpy-compatible library for nvidia gpu calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS), 2017.
  • [110] T. E. Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.
  • [111] G. Papadopoulos, P. J. Edwards, and A. F. Murray. Confidence estimation methods for neural networks: a practical comparison. IEEE Trans. Neural Networks, 12(6):1278–1287, 2001.
  • [112] Y. Park, B. Kwon, J. Heo, X. Hu, Y. Liu, and T. Moon. Estimating pm2. 5 concentration of the conterminous united states via interpretable convolutional neural networks. Environmental Pollution, 256:113395, 2020.
  • [113] V. Petsiuk, A. Das, and K. Saenko. RISE: randomized input sampling for explanation of black-box models. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, page 151, 2018.
  • [114] K. Preuer, G. Klambauer, F. Rippmann, S. Hochreiter, and T. Unterthiner. Interpretable Deep Learning in Drug Discovery, pages 331–345. Springer International Publishing, Cham, 2019.
  • [115] G. Quellec, K. Charrière, Y. Boudi, B. Cochener, and M. Lamard. Deep image mining for diabetic retinopathy screening. Medical Image Analysis, 39:178–193, 2017.
  • [116] M. T. Ribeiro, S. Singh, and C. Guestrin. ”why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144, 2016.
  • [117] L. Rieger, P. Chormai, G. Montavon, L. Hansen, and K.-R. Müller. Structuring Neural Networks for More Explainable Predictions, pages 115–131. Springer, 2019.
  • [118] F. Rosenblatt.

    The perceptron: A probabilistic model for information storage and organization in the brain.

    Psychological Review, 65(6):386–408, 1958.
  • [119] A. S. Ross, M. C. Hughes, and F. Doshi-Velez. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 2662–2670, 2017.
  • [120] R. Rothe, R. Timofte, and L. Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 10–15, 2015.
  • [121] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [122] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. Müller. Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems, 28(11):2660–2673, 2016.
  • [123] W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K.-R. Müller, editors. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of Lecture Notes in Computer Science. Springer, 2019.
  • [124] J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  • [125] B. Schölkopf and A. J. Smola.

    Learning with Kernels: support vector machines, regularization, optimization, and beyond

    Adaptive computation and machine learning series. MIT Press, 2002.
  • [126] B. Schölkopf, A. J. Smola, and K.-R. Müller.

    Nonlinear component analysis as a kernel eigenvalue problem.

    Neural Computation, 10(5):1299–1319, 1998.
  • [127] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko.

    Quantum-chemical insights from deep tensor neural networks.

    Nature Communications, 8:13890, 2017.
  • [128] K. T. Schütt, M. Gastegger, A. Tkatchenko, and K.-R. Müller. Quantum-chemical insights from interpretable atomistic neural networks. In Explainable AI, volume 11700 of Lecture Notes in Computer Science, pages 311–330. Springer, 2019.
  • [129] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 618–626, 2017.
  • [130] C. Shan. Learning local features for age estimation on real-life faces. In Proceedings of the 1st ACM international workshop on Multimodal pervasive video analysis, pages 23–28. ACM, 2010.
  • [131] L. S. Shapley. 17. a value for n-person games. In Contributions to the Theory of Games (AM-28), Volume II. Princeton University Press, 1953.
  • [132] A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3145–3153, 2017.
  • [133] A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje. Not just a black box: Learning important features through propagating activation differences. CoRR, abs/1605.01713, 2016.
  • [134] A. Shrikumar, J. Su, and A. Kundaje. Computationally efficient measures of internal neuron importance. CoRR, abs/1807.09946, 2018.
  • [135] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR (Workshop Poster), 2014.
  • [136] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [137] D. Smilkov, N. Thorat, B. Kim, F. B. Viégas, and M. Wattenberg. Smoothgrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017.
  • [138] C. Soneson, S. Gerster, and M. Delorenzi. Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation. PLoS ONE, 9(6):e100335, June 2014.
  • [139] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. Striving for simplicity: The all convolutional net. In International Conference of Learning Representations (ICLR), 2015.
  • [140] I. Sturm, S. Lapuschkin, W. Samek, and K.-R. Müller. Interpretable deep neural networks for single-trial eeg classification. Journal of Neuroscience Methods, 274:141–145, 2016.
  • [141] D. Su, H. Zhang, H. Chen, J. Yi, P. Chen, and Y. Gao. Is robustness the cost of accuracy? - A comprehensive study on the robustness of 18 deep image classification models. In ECCV (12), volume 11216 of Lecture Notes in Computer Science, pages 644–661. Springer, 2018.
  • [142] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3319–3328, 2017.
  • [143] W. R. Swartout and J. D. Moore. Explanation in second generation expert systems. In Second Generation Expert Systems, pages 543–585. Springer Berlin Heidelberg, 1993.
  • [144] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In ICLR (Poster), 2014.
  • [145] A. W. Thomas, H. R. Heekeren, K.-R. Müller, and W. Samek. Interpretable LSTMs for whole-brain neuroimaging analyses. CoRR, abs/1810.09945, 2018.
  • [146] A. W. Thomas, H. R. Heekeren, K.-R. Müller, and W. Samek. Analyzing neuroimaging data through recurrent deep learning models. Frontiers in Neuroscience, 13:1321, 2019.
  • [147] H. Traunmüller and A. Eriksson. The frequency range of the voice fundamental in the speech of male and female adults. Unpublished manuscript, 1995.
  • [148] M. Tsang, D. Cheng, and Y. Liu. Detecting statistical interactions from neural network weights. In ICLR (Poster)., 2018.
  • [149] B. Ustun, A. Spangher, and Y. Liu. Actionable recourse in linear classification. In FAT, pages 10–19. ACM, 2019.
  • [150] V. Vapnik.

    The nature of statistical learning theory

    Springer, 1995.
  • [151] M. M.-C. Vidovic, N. Görnitz, K.-R. Müller, G. Rätsch, and M. Kloft. Opening the black box: Revealing interpretable sequence motifs in kernel-based learning algorithms. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 137–153. Springer, 2015.
  • [152] C. Von der Malsburg. Binding in models of perception and brain function. Current opinion in neurobiology, 5(4):520–526, 1995.
  • [153] A. Warnecke, D. Arp, C. Wressnegger, and K. Rieck. Don’t paint it black: White-box explanations for deep learning in computer security. CoRR, abs/1906.02108, 2019.
  • [154] C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning. MIT press Cambridge, MA, 2006.
  • [155] Y. Yang, V. Tresp, M. Wunderle, and P. A. Fasching. Explaining therapy predictions with layer-wise relevance propagation in neural networks. In IEEE International Conference on Healthcare Informatics, ICHI 2018, New York City, NY, USA, June 4-7, 2018, pages 152–162, 2018.
  • [156] I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797–1808, Dec. 1998.
  • [157] K. Young, G. Booth, B. Simpson, R. Dutton, and S. Shrapnel. Deep neural network or dermatologist? In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support, pages 48–55. Springer, 2019.
  • [158] T. Zahavy, N. Ben-Zrihem, and S. Mannor. Graying the black box: Understanding dqns. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 1899–1908., 2016.
  • [159] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference Computer Vision - ECCV 2014, pages 818–833, 2014.
  • [160] J. Zhang, Z. Bargal, Sarah Adeland Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102, 2018.
  • [161] Z. Zhang, P. Chen, M. McGough, F. Xing, C. Wang, M. Bui, Y. Xie, M. Sapkota, L. Cui, J. Dhillon, et al. Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nature Machine Intelligence, 1(5):236–245, 2019.
  • [162] B. Zhou, D. Bau, A. Oliva, and A. Torralba. Interpreting deep visual representations via network dissection. IEEE transactions on pattern analysis and machine intelligence, 41(9):2131–2145, 2018.
  • [163] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba.

    Learning deep features for discriminative localization.

    In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2921–2929, 2016.
  • [164] X.-P. Zhu, J.-M. Dai, C.-J. Bian, Y. Chen, S. Chen, and C. Hu. Galaxy morphology classification with deep convolutional neural networks. Astrophysics and Space Science, 364(4):55, 2019.
  • [165] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling. Visualizing deep neural network decisions: Prediction difference analysis. In International Conference on Learning Representations (ICLR), 2017, 2017.