Deep neural networks (DNNs) are increasingly used in place of traditionally engineered software in many areas. DNNs are complex non-linear functions with algorithmically generated (and not engineered) coefficients, and therefore are effectively “black boxes”. They are given an input and produce output, but the functional processes that generate these outputs are difficult to explain . The goal of explainable AI is to create artifacts that provide a rationale for why a neural network generates a particular output for a particular input. This is argued to enable stakeholders to understand and appropriately trust neural networks .
A typical use-case of DNNs is to classify highly dimensional inputs such as images. DNNs are multi-layered networks with a predefined structure that consists of layers of neurons. The coefficients for the neurons are determined by a training process on a data set with given classification labels. The standard criterion for the adequacy of training is the accuracy of the network on a separate validation data set. This criterion is clearly only as comprehensive as the validation data set. In particular, this approach suffers from the risk that the validation data set is lacking an important instance [90, 39].
Explanations have been claimed to address this problem by providing additional insight into the decision process of a neural network [26, 56]. Explanations can be used to guide the training process to the missing inputs and to signal when the decisions are sufficiently accurate.
State of the art
The state of the art in producing explanations for image classifiers is an approach called SHapley Additive exPlanations (SHAP) 
, which assigns an “importance value” to each pixel. The algorithm treats the classification of a multi-dimensional input as a multi-player collaborative game, where each player represents a dimension. The importance value of a pixel is the contribution it makes to the classification. This method provides a reasonable, and accurate, explanation based on game theory. In practice, owing to the computational complexity of the algorithm, the implementation approximates the solution, leading to inaccuracies. In addition, since SHAP combines multiple techniques that are conceptually different, including[61, 66, 16], it is intrinsically complex.
In traditional software development, statistical fault localization measures have a substantial track record of finding causes of errors to aid in the debugging process of sequential programs [54, 47, 49, 41]. These measures rank program locations by counting the number of times a particular location is visited in passing and in failing executions for a given test suite and by applying a statistical formula. Hence, they are comparatively inexpensive to compute. Their precision depends on the quality of the test suite [82, 86, 59, 34, 64, 17]. There are more than a hundred of measures mentioned in the literature . Some of the most widely used measures are Zoltar, Ochiai, Tarantula, and Wong-II [20, 55, 35, 84], which have been shown to outperform other measures in experimental comparisons.
We present Protozoa, which provides explanations for DNNs that classify images. Our explanations are synthesized from the ranking of image pixels by statistical fault localization measures using test suites constructed by randomly mutating a given input image. Our tool integrates four well-known measures (Zoltar, Ochiai, Tarantula and Wong-II). Experimental results on the ImageNet data set show that our explanations are visually better than those generated by SHAP. As “visually better” is not an objective metric, we measure the efficiency of the generation of adversarial example as a proxy for the quality of our explanations. While clearly not identical to “visually better”, this metric has the advantage that it is objective and algorithmically computable. Our experimental results show that the explanations produced by Protozoa yield more adversarial examples than those produced by SHAP. An additional advantage of Protozoa is that it treats the DNN as a black-box, and that it is highly scalable.
The tool and the data for the experiments described in this paper, together with the scripts and the experimental setup, can be downloaded online111https://github.com/theyoucheng/protozoa.
Ii-a Deep neural networks (DNNs)
We briefly review the relevant definitions of deep neural networks. Let be a deep neural network with -layers. For a given input , calculates the output of the DNN, which could be, for instance, a classification label. Images are still the most popular inputs for DNNs, and in this paper we focus on DNNs that classify images. Specifically, we have
where and for are learn-able parameters, and is the layer function that maps the output of layer , i.e., , to the input of layer
. The combination of the layer functions yields highly complex behavior, and the analysis of the information flow within a DNN is challenging. There are a variety of layer functions for DNNs, including, e.g., fully connected layers, convolutional layers and max-pooling layers.
Our algorithm is independent of the specific internals of the DNN. Given a particular input image and ’s output , we present to the user a subset of the pixels of that explain why outputs when given . In the following, we use to denote the output of for an input image .
Ii-B Spectrum-based fault localization (SBFL)
Our work is inspired by spectrum based fault localization [54, 47, 49, 91, 82, 86, 59, 34, 64, 17, 83, 20, 55, 35, 84, 41], which has been widely used as an efficient approach to automatically locate root causes of failures of programs. SBFL techniques rank program elements (e.g., statements or assignments) based on their suspiciousness scores. Intuitively, a program element is more suspicious if it appears in failed executions more frequently than in correct executions (the exact formulas for ranking differ between the measures). Diagnosis of the faulty program can then be conducted by examining the ranked list of elements in descending order of their suspiciousness until the culprit for the fault is found.
The SBFL procedure first executes the program under test using a set of inputs. It records the program executions as program spectra, meaning that the execution is instrumented to modify a set of Boolean flags that indicate whether a particular program element was executed. The task of a fault localization tool is to compute a ranking of the program elements based on the program spectra. Following the notation in , the suspiciousness score of each program statement is calculated from a set of parameters that give the number of times the statement is executed () or not executed () on passing () and on failing () tests. For instance, is the number of tests that have passed and that have executed .
A large number of measures has been proposed to calculate the suspicious score of each program element. We list below some of the most widely used measures; those are also the measures that we use in our ranking procedure.
The tool then presents to the user the list of program elements in the order of descending suspiciousness score. As reported in surveys [83, 49, 41], there is no single best measure for fault localization. Different measures perform better on different types of applications.
Iii What is an explanation?
An adequate explanation to an output of an automated procedure is essential in many areas, including verification, planning, diagnosis, etc. It is clear that an explanation is essential in order to increase a user’s confidence in the result or to determine whether there is a fault in the automated procedure (if the explanation does not make sense). It is less clear how to define what a “useful” explanation is. There have been a number of definitions of explanations over the years in various domains of computer science [10, 18, 57], philosophy  and statistics explainable AI, which is advocated, among others, by DARPA  to promote understanding, trust, and adoption of future autonomous systems based on learning algorithms (and, in particular, image classification DNNs).
DARPA provides a list of questions that a good explanation should answer and an epistemic state of the user after receiving a good explanation. The description of this epistemic state boils down to adding useful information about the output of the algorithm and increasing trust of the user in the algorithm.
In this paper, we are going to loosely adopt the definition of explanations by Halpern and Pearl , which is based on their definition of actual causality . Roughly speaking, Halpern and Pearl state that a good explanation gives an answer to the question “why did this outcome occur”, and is similar in spirit to DARPA’s informal description. As we are not defining our setting in terms of actual causality, we are omitting the parts of the definition that refer to causal models and causal settings. The remaining parts of the definition of explanation are:
an explanation is a sufficient cause of the outcome;
an explanation is a minimal such cause (that is, it does not contain irrelevant or redundant elements);
an explanation is not obvious; in other words, before being given the explanation, the user could conceivably imagine other explanations for the outcome.
In image classification using DNNs, the non-obviousness holds for all but extremely trivial images. Translating and into our setting, we get the following definition:
An explanation in image classification is a minimal subset of pixels of a given input image that is sufficient for the DNN to classify the image.
A straightforward approach to computing an explanation consistent with Definition 1 would be to check all subsets of pixels of a given image for minimality and sufficiency for the DNN to classify the image. The run-time complexity of this approach, however, is exponential in the size of the image, and is hence infeasible for all but very tiny images. In Section IV we describe a different approach to computing an explanation and argue that it produces a good approximation for a precise explanation.
There exist other definitions of explanations for decisions of DNNs in the literature . They are not used for presenting an explanation to the user, but rather as a theoretical model for ranking the pixels. We argue that our definition suits its purpose of explaining the DNN’s decisions to the user, as it matches the intuition of what would constitute a good explanation and is consistent with the body of work on explanations in AI.
Iv Spectrum-Based Explanation (SBE) for DNNs
We propose a lightweight black-box explanation technique based on spectrum fault localization. In traditional software development, SBFL measures are used for ranking program elements that cause a failure. In our setup, the goal is different: we are searching for an explanation of why a particular input to a given DNN yields a particular output; our technique is agnostic to whether the output is correct.
Constructing the test suite
SBFL requires test inputs. Given an input image that is classified by the DNN as , we generate a set of images by randomly mutating . A legal mutation masks a subset of the pixels of , i.e., sets these pixels to the background color. The DNN computes an output for each mutant; we annotate it with “” if that output matches that of , and with “” to indicate that the output differs. The resulting test suite of annotated mutants is an input to the Protozoa algorithm.
We assume that the original input consists of pixels . Each test input exhibits a particular spectrum for the pixel set, in which some pixels are the same as in the original input and others are masked. The presence or masking of a pixel in may affect the DNN’s output. In the following, we will use SBFL measures to find a set of pixels that constitute an explanation of the DNN’s output for .
We use SBFL measures to rank the set of pixels of by slightly abusing the notions of passing and failing tests. For a pixel of
, we compute the vectoras follows:
stands for the number of mutants in annotated with in which is not masked;
stands for the number of mutants in annotated with in which is not masked;
stands for the number of mutants in annotated with in which is masked;
stands for the number of mutants in annotated with in which is masked.
Once we construct the vector for every pixel, we can apply SBFL measures discussed in Section II-B to rank the pixels in for their importance regarding the DNN’s output (the importance corresponds to the suspiciousness score computed by SBFL measures). A set of top-ranked pixels (
of the pixels for most images) is provided to the user as a heuristic explanation of the decision of the DNN. This set is chosen by iteratively adding pixels to the set in the descending order of their ranking (that is, we start with the highest-ranked pixels) until the set becomes sufficient for the DNN to classify the image.
Iv-a Spectrum-based explanation (SBE) algorithm
We now present our algorithms for generating test suites and computing explanations. The computation of an SBE for a given DNN is described in detail in Algorithm 1. Given the DNN , a particular input and a particular fault localization measure , it synthesizes the subset of pixels that present an approximation of an explanation according to Definition 1.
Algorithm 1 starts by calling the procedure to generate the set of test inputs (Line ). It then computes the vector for each pixel using the set . Then, the algorithm computes the ranking of each pixel according to the specified measure (Lines –). Formulas for measures are as in Equation (2a)–(2d). The pixels are listed in the descending order of ranking (from high to low ) (Line ).
Starting from Line in Algorithm 1, we construct a subset of pixels to explain ’s output on this particular input as follows. We add pixels to while ’s output on does not match . This process terminates when ’s output is the same as on the whole image . Finally, is returned as the explanation. At the end of this section we discuss why is not a precise explanation according to Definition 1 and argue that it is a good approximation (coinciding with a precise explanation in most cases).
As the quality of the ranked list computed by SBFL measures inherently depends on the quality of the test suite, the choice of the set of mutant images plays an important role in our spectrum based explanation algorithm for DNNs. While it is beyond the scope of this paper to identify the best set , we propose an effective method for generating in Algorithm 2. The core idea of Algorithm 2 is to balance the number of test inputs annotated with “” (that play the role of the passing traces) with the number of test inputs annotated with “” (that play the role of the failing traces).
The fraction of the set of pixels of that are going to be masked in a mutant is initialized by a random or selected number between and (Line ) and is later updated at each iteration according to the decision of on the previously constructed mutant. In each iteration of the algorithm, a randomly chosen set of () of pixels in is masked and the resulting new input is added to (Lines –). Roughly speaking, if a current mutant is not classified as being the same as , we decrease the fraction of masked pixels (by a pre-defined small number ); if a current mutant is classified as , we increase the fraction of masked pixels (by the same ).
Iv-B Relationship between SBE and Definition 1
Ranking the pixels using SBFL measures and then selecting the top-ranked pixels benefits from running time complexity that is linear in the size of the set and the size of the image, and is hence much more efficient than the straightforward computation of all possible explanations as described in Section III. One of the reasons for this low complexity is that there can be many possible explanations (exponentially many, as any subset of pixels of the image can be an explanation), whereas we only need to provide one explanation to the user. It is also easy to see that the heuristic explanation that we provide is a sufficient set of pixels, since this is a stopping condition for adding pixels back to the image.
However, the set might not be minimal, and thus does not, strictly speaking, satisfy all conditions of Definition 1. The reason for possible non-minimality is that the pixels of are added to the explanation in the order of their ranking, with the highest-ranking pixels being added first. It is, therefore, possible that there is a high-ranked pixel that was added in one of the previous iterations, but is now not necessary for the correct classification of the image (note that the process of adding pixels to the explanation stops when the DNN successfully classifies the image; this, however, shows minimality only with respect to the order of addition of pixels). We believe that this is unlikely, as higher-ranked pixels tend to be more important to the correct classification than lower-ranked ones when using a good SBFL measure, based on emphirical evidence.
Finally, the SBEs we provide are clearly not obvious (defined in Section III), as the users do not know them in advance, thus fulfilling the condition of enriching the user’s knowledge of the DNN’s decisions.
V Experimental Evaluation
In this section we describe the experimental evaluation of Protozoa. We start with showing that explanations generated by Protozoa match the human intuition about an explanation for a given classification of an image (Section V-B). We continue the evaluation by comparing Protozoa to SHAP (Section V-C). Since the degree of the alignment of our explanations with human intuition is expensive to quantify, we introduce two proxies for comparison and demonstrate experimental results on large sets of images. As SBFL measures inherently depend on the quality of the test suite, in Section V-D we present an evaluation demonstrating an impact of the size of the set of mutants and the balance between passing and failing mutants in this set on the quality of explanations generated by Protozoa. Finally, in Section V-E we show that SBE can also be used to assess the quality of training of a given DNN: a non-intuitive explanation indicates that the training was not sufficient.
We implement the spectrum-based explanation (SBE) algorithm for DNNs presented in Section IV in the tool Protozoa. The tool supports four fault localization measures, which are Tarantula, Zoltar, Ochiai and Wong-II. We evaluate the quality of explanations provided by Protozoa on image inputs from the ImageNet Large Scale Visual Recognition Challenge , which is the most sophisticated and comprehensive benchmark for the DNN recognition problem. Popular neural networks for ImageNet such as Xception , MobileNet , VGG16  and InceptionV3  have been integrated into the tool.
As SBE is very lightweight, we were able to conduct all experiments on a laptop with an Intel i7-7820HQ (8) running at 3.9 GHz and with 16 GB of memory. None of the experiments require a stronger machine.
We configure the heuristic tests generation in Algorithm 2 with and , and the size of the test set equal to . These values have been chosen empirically and remain the same through all experiments. It is quite possible that they are not fine tuned to all image inputs, and that for some inputs increasing or tuning and would produce a better explanation.
Tarantula is used as the default measure in Protozoa. Again, this was chosen empirically, as it seems that the explanations provided by Tarantula are the most intuitive for the majority of the images.
V-B What does a well-trained DNN see?
We apply the SBE approach to Google’s Xception model to illustrate how a state-of-the-art DNN makes decisions. A recent comparison of DNN models for ImageNet  suggests that Xception is one of the best models for this data set.
For each input image, the Xception DNN outputs a classification of the image out of classes, and we apply Protozoa to compute an explanation of this classification, that is, a subset of top-ranked pixels of the original image that explain why Xception classifies the image in the way it does.
Figure 1 exhibits a set of images and their corresponding explanations found by Protozoa. More results can be found at anonymized url https://github.com/theyoucheng/protozoa, and we encourage the reader to try Protozoa with different input images and neural network models.
Overall, the explanations computed by Protozoa match human intuition. It is straightforward for a human to identify which part of the image is supposed to trigger the DNN’s decision. As we show later, the explanations can also be used to assess the quality of the training of the DNN. As Xception is a high-quality, well-trained DNN, the explanations for its decisions are highly consistent with our intuition, and, in particular, do not contain significant parts of the background, which should be irrelevant for the DNN’s decision.
V-C Quantitative evaluation
As this is the first paper that applies spectrum-based fault localization to explaining the outputs of deep neural networks, we focus on the feasibility and usefulness of the explanations, and less on possible performance optimizations.
We compare Protozoa with SHAP222https://github.com/slundberg/shap, the state-of-the-art machine learning tool to explain DNN outputs. Given a particular input image, SHAP assigns each of its pixels an importance value; higher values correspond to pixels that are more important for the DNN’s output. The explanation then can be constructed by identifying the pixels that are top ranked. For the comparison between the tools, we replace the in Algorithm 1 with the importance ranking computed by SHAP.
We use MobileNet as the DNN model that is to be explained: it is is nearly as accurate as VGG16, while being times faster . Moreover, we have observed that the explanations generated for several mainstream ImageNet models, including Xception, MobileNet and VGG16, are largely consistent.
It is challenging to evaluate the quality of DNN explanations, owing to the lack of an objective measure. As we saw in Section V-B, the quality of explanations is a matter of perception by humans. To compare several explanations for DNN outputs automatically at a large scale, we need computable metrics. We design two proxies for this purpose: (1) the size of generated explanations, and (2) the generation of adversarial examples.
Size of explanations
An explanation computed by Algorithm 1 is a subset of top-ranked pixels out of the set of all pixels that is sufficient for the DNN to classify the image correctly. When comparing explanations, the ranking for Protozoa is computed as described in the algorithm; for SHAP, we use the importance values of the pixels. We define the size of the explanation as . Intuitively, the smaller this size is, the more accurately we captured the decision process of the DNN, hence smaller explanations are considered better.
Figure 2 gives the comparison with respect to the size of generated explanations between our SBE approach and SHAP. For each point in Figure 2, the position on the -axis indicates the size of the explanation, and the position on the -axis gives the accumulated percentage of explanations: that is, all generated explanations with smaller or equal sizes. Figure 2 contains the SBE results for four SBE measures (Ochiai, Zoltar, Tarantula and Wong-II) that are used for ranking; the blue line for Protozoa represents the explanation with smallest size among the four measures.
The data in Figure 2 allows us to make the following observations.
Using spectrum-based ranking for explanations is significantly better in terms of the size of the explanation compared to SHAP on the images in ImageNet.
Except for Wong-II, the results produced by spectrum-based measures are very close to each other; on the other hand, no single measure consistently outperforms the others on all input images; hence Protozoa, which chooses the smallest explanation for each image, outperforms all individual measures.
Figure 3 gives an example of an input image (“original image”, depicting a raccoon) and the explanations produced when using four SBFL measures and when using SHAP. We can see that the explanation based on SHAP’s importance values classifies many background pixels as important, hence resulting in a large explanation. By contrast, Tarantula top-ranks the pixels that belong to the raccoon’s head (and are presumably the most important for correct classification), resulting in a much smaller explanation. On this image, Ochiai and Zoltar produce similar explanations (better than SHAP, but worse than Tarantula), and Wong-II, while localizing a part of the raccoon’s image, gives a high ranking to more background pixels than any of the other SBFL measures.
Another observation that is illustrated well by Figure 3 and that holds for almost all images in our evaluation, is that explanations based on SHAP’s importance values tend to resemble low-resolution variants of the original images. They consist of sets of pixels spread across the entire image, and include a lot of background. By contrast, our explanations focus on one area that is crucial for classifying the image.
Generation of adversarial examples
Adversarial examples  are a major safety concern for DNNs. An adversarial example is defined to be a perturbed input image that is a) very close to an image in the data set and that is b) classified with a different label by the DNN. In this section, we use adversarial examples as a proxy to compare the effectiveness of spectrum-based explanations and SHAP. In particular, following the ordering of pixels according to their ranking (from the SBE approach or SHAP), we change the original image pixel by pixel starting from the top-ranked ones until the DNN changes its classification, i.e., an adversarial example is found. We limit the number of changed pixels to . We then record the number of pixels changed (normalized over the total number of pixels) in the adversarial example. In our setup, changing a pixel means assigning it black color.
There is a significant body of research dedicated to the efficient generation of adversarial examples [22, 23, 9], and we do not attempt to compete with the existing specialised methods. Notably, adversarial examples can be generated by changing a single pixel only . In our setup, the changes to pixels are inherently pessimistic (in other words, there might be another color that leads to more a efficient generation of adversarial examples). We remind the reader that our framework for generating adversarial examples is solely used as a proxy to assess the quality of explanations of Protozoa and SHAP.
Figure 4 provides the comparison between the SBE measures and SHAP for guiding the generation of adversarial examples. Figure 5 gives an example of this method: for the input image of a salt shaker, we provide the adversarial examples generated by the SBFL measures and by SHAP. It is easy to see that, while all generated images still look like a salt shaker, the one generated using the Tarantula ranking produces an adversarial example that is closest to the original image.
On average, a DNN’s output can be changed by modifying a very small number of pixels in the image, much smaller than the number of pixels needed for a correct classification of the image. This explains why the gap between different approaches in Figure 4 is significantly smaller than in Figure 2. Yet, it is still clear that SHAP’s ranking yields a much lower number of adversarial examples than almost all SBFL measures, and hence also Protozoa. Moreover, no single measure consistently outperforms the others on all input images (note that the performance of Wong-II significantly degrades after changing more than of pixels); hence Protozoa, which chooses the smallest modification for each image, outperforms all individual measures significantly.
The experiment shows that the ranking computed by the SBFL measures is more efficient than the one computed by SHAP for guiding the generation of adversarial examples. This result is consistent with the results in Figure 2.
V-D Tuning the parameters in Algorithm 2
In this section we study the effect of changing the parameters in Algorithm 2, and, specifically, the size of the set of mutant images and the parameters and that are used for generating passing and failing mutants. We show that, as expected, the quality of explanations improves with a bigger set of tests ; however, changing the balance between the passing and the failing mutants in does not seem to have a significant effect on the results.
We conduct two experiments. In the first experiment, we study the effect of changing the size of by computing the ranking using the different mutant sets. In the original setup, . We generate a smaller set of size , and we compare the explanations obtained when using to the ones obtained when using . In Figure 6, we show the average size of the explanations for different SBFL measures and sets of mutant images of size and .
As expected, the quality of SBEs improves, meaning they have fewer pixels, when more test inputs are used as spectra in Algorithm 1. This suggests that the effort of using a large set of test inputs is rewarded with a high quality of the generated explanations for the decisions of the DNN. We remark that this observation is hardly surprising, and is consistent with prior experience applying spectrum-based fault localization measures to traditional software.
In Figure 7 we record the running time of Protozoa for different and compare it to the running time of SHAP. The running time of Protozoa is separated into two parts: the time taken for the execution of the test set (Algorithm 2) and the time taken for the subsequent computation of the ranked list and extracting an explanation (Algorithm 1). It is easy to see that almost the whole execution time of Protozoa is dedicated to the execution of . When comparing the explanation extraction only, Protozoa is more efficient than SHAP. Hence, if the set is computed in advance or is given to Protozoa as an input, the computation of SBE is very lightweight. Another alternative for improving the running time is to first execute Protozoa with a small set (of tests), and to generate a large only if the explanation is low quality.
When SBFL measures are applied to software, the quality of the ranking is known to depend on the balance between passing and failing traces in the test suite. In our setting, this is the balance is between the tests labeled with “” and with “” in . That balance is controlled by the parameters and . We test the dependence of the quality of SBEs on this balance between the tests directly by designing the following two types of test suites (both with tests):
the “Type-” kind of is generated by adding an additional set of tests annotated with “”; and
the “Type-not-” kind of is generated by adding an additional set of tests annotated with “”.
Thus, instead of relying on and to provide a balanced set of tests, we tip the balance off intentionally. We then run Protozoa with these two types of biased sets of tests.
Figure 8 gives the sizes of explanations for the two types of sets of tests. It is easy to see that the Protozoa algorithm is remarkably robust with respect to the balance between the different types of tests in (as the columns are of roughly equal height). Again, Wong-II stands out and appears to be more sensitive to the ratio of failing/passing tests in .
V-E Using explanations to assess the progress of training of DNNs
An important use-case of explanations of DNN outputs is assessing the adequacy of training of the DNN. To demonstrate this, we have trained a DNN on the CIFAR-10 data set . We apply Protozoa after each iteration of the training process to the intermediate DNN model. In Figure 9 we showcase some representative results at different stages of the training.
Overall, as the training procedure progresses, explanations of the DNN’s decisions focus more on the “meaningful” part of the input image, e.g., those pixels contributing to the image (see, for example, the progress of the training reflected in the explanations of DNN’s classification of the first image as a “cat”). This result reflects that the DNN is being trained to learn features of different classes of inputs. Interestingly, we also observed that the DNN’s feature learning is not always monotonic, as demonstrated in the bottom row of Figure 9: after the th iteration, explanations for the DNN’s classification of an input image as an “airplane” drift from intuitive parts of the input towards pixels that may not fit human interpretation (we repeated the experiments multiple times to minimize the uncertainty because of the randomization in our SBE algorithm).
The explanations generated by Protozoa may thus be useful for assessing the adequacy of the DNN training; they may enable checks whether the DNN is aligned with the developer’s intent when training the neural network. The explanations can be used as a stopping condition for the training process: training is finished when the explanations align with our intuition.
|Original||It. 1||It. 5||It. 10||It. 20|
Vi Threats to Validity
Lack of ground truth
When evaluating the generated explanations, there is no ground truth to compare. Ultimately, we use two proxies, the size of the explanation and the effort required for generating adversarial examples.
Selection of the dataset
In this paper, we focus on the image recognition problem for high-resolution color images and collect most of the experimental results using the ImageNet data set. Small benchmarks and problems may have their own features that differ from what we report in this paper. It is known that, in traditional software, the performance of different spectrum-based measures can vary dramatically given the benchmark used. SHAP has been applied to DNNs with non-image input.
Selection of SBFL measures
We have only evaluated four spectrum-based measures (Ochiai, Zoltar, Tarantula and Wong-II). There are hundreds more such measures, which may reveal new observations.
Selection of parameters when generating test inputs
When generating the test suite , we empirically configure the parameters in the test generation algorithm. The choice of parameters affects the results of the evaluation and they may be overfitted.
Adversarial example generation algorithm
There is a variety of methods to generate adversarial examples, including sophisticated optimization algorithms. Instead, as a proxy to evaluate the effectiveness of explanations from Protozoa and SHAP, we adopt a simple method that blacks out selected pixels of the original image. A more sophisticated algorithm might yield different results, and might favor the explanations generated by SHAP.
Vii Related Work
This work connects two seemingly distinct research topics: spectrum-based fault localization and explainable AI. We briefly summarize related research from software engineering and machine learning.
Explanation or interpretation of deep learning is a very active area in machine learning. Explanations of trained models is done by visualising hidden neurons, ranking input dimensions and other methods. LIME  interprets model predictions by locally approximating the model around a given prediction. Based on LIME and a few other methods [73, 5, 67, 65], SHAP  suggests a general additive model and defines the importance value of each pixel as its Shapley value in a multi-player cooperative game, where each player represents a pixel. SHAP is the most up-to-date ranking method, and is compared with our method in the paper. The core of our proposed SBE method is also pixel ranking, however, our ranking uses the suspiciousness score computed by SBFL measures.
Our SBE approach is closely related to fault localization. Besides the spectrum-based fault localization [54, 47, 49, 82, 86, 59, 34, 64, 17, 83, 20, 55, 35, 84, 41] discussed earlier in this paper, there are a large number of further fault localization methods, e.g., model based [6, 15], slice based , interaction driven [45, 31], similarity aware , semantic fault localization  and a number of others [37, 85, 89, 81]. As indicated in work like [91, 48, 80, 43], there may be merit in combining different fault localization methods. Meanwhile, work like [29, 42, 8, 14] focus on better constructing or optimizing the test suite for fault localization. Fault localization is related to automating fixes for programs [7, 46]
. Genetic programming can be used to improve SBFL[69, 11].
Software engineering for AI
There is a broad body of work on applying software engineering research to deep learning, and this paper is aligned with this idea. Among them, DeepFault  is closest to ours. It applies spectrum-based fault localization to identify suspicious neurons that are responsible for inadequate DNN results; it is a white-box method and it was only tested on small benchmarks such as MNIST  and CIFAR-10 . In , symbolic execution is used to find important pixels for the MNIST data set.
In [58, 51, 71], multiple structural test coverage criteria have been proposed for DNNs with the goal to support testing. The growing safety concern in deep learning based autonomous systems drives the work on testing methods for the DNN component [76, 87, 40]. A number of testing approaches have been adopted for testing DNN models [53, 52, 72, 77, 79, 78, 36]
. The most popular library to implement DNNs is TensorFlow and an empirical study of TensorFlow program bugs can be found in .
This paper advocates the application of spectrum-based fault localization for the generation of explanations of the output of neural networks. We have implemented this idea in the tool Protozoa. Experimental results using advanced deep learning models and comparing our tool with SHAP confirm that our spectrum-based approach to explanations is able to deliver good explanations for DNN’s outputs at low computational cost.
This work can be extended in numerous directions. We have demonstrated that applying well-known spectrum-based fault localization measures is useful for providing explanations of the output of a DNN. As these measures are based on statistical correlation, it may be the case that measures that work well for fault localization in general software are inadequate for DNNs; it may be worthwhile to investigate new measures that are specialized for DNNs. Our work uses random mutation to produce the set of test inputs. While this is an efficient approach to generate the test inputs, it may be possible to obtain better explanations by using a more sophisticated method for generating the test inputs; an option are white-box fuzzers such as AFL.
Another direction is to extend the explanation-based approach to other areas where DNNs are used. While the type of explanations we construct in this paper works well for images, it might be that for other input types other techniques will be more suitable. Finally, as our algorithm is agnostic to the structure of the DNN, it applies immediately to DNNs with state, say recurrent neural networks for video processing. Future work could benchmark the quality of explanations generated for this use case.
-  External Links: Cited by: §V-B.
-  (2016) TensorFlow: a system for large-scale machine learning. In OSDI, Vol. 16, pp. 265–283. Cited by: §VII.
-  Ethics guidelines for trustworthy AI. External Links: Cited by: §I.
-  (2011) Fault-localization using dynamic slicing and change impact analysis. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 520–523. External Links: Cited by: §VII.
-  (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10 (7). Cited by: §VII.
-  (2017) Fast test suite-driven model-based fault localisation with application to pinpointing defects in student programs. Software & Systems Modeling, pp. 1–27. Cited by: §VII.
-  (2017) Where is the bug and how is it fixed? An experiment with practitioners. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 117–128. Cited by: §VII.
-  (2013) Entropy-based test generation for improved fault localization. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, pp. 257–267. Cited by: §VII.
-  (2017) Towards evaluating the robustness of neural networks. In Security and Privacy (SP), IEEE Symposium on, pp. 39–57. Cited by: §V-C.
Defining explanation in probabilistic systems.
UAI ’97: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, pp. 62–71. Cited by: §III.
-  (2018) Learning fault localisation for both humans and machines using Multi-Objective GP. In Proceedings of the 10th International Symposium on Search Based Software Engineering, SSBSE 2018, pp. 349–355. Cited by: §VII.
-  (2017) Xception: deep learning with depthwise separable convolutions. In , pp. 1251–1258. Cited by: §V-A.
-  (2019) Semantic fault localization and suspiciousness ranking. In 25th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Vol. 25. Cited by: §VII.
-  (2018) Reduce before you localize: delta-debugging and spectrum-based fault localization. In 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 184–191. Cited by: §VII.
-  (2017) A method to localize faults in concurrent C programs. Journal of Systems and Software 132, pp. 336–352. Cited by: §VII.
-  (2016) Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In 2016 IEEE symposium on security and privacy (SP), pp. 598–617. Cited by: §I.
-  (2016) Test set diameter: quantifying the diversity of sets of test cases. In International Conference on Software Testing, Verification and Validation (ICST), pp. 223–233. Cited by: §I, §II-B, §VII.
-  (1988) Knowledge in flux. MIT Press. Cited by: §III.
-  (2019) DeepFault: fault localization for deep neural networks. In 22nd International Conference on Fundamental Approaches to Software Engineering, (English). Cited by: §VII.
-  (2007) Automatic error detection techniques based on dynamic invariants. M.S. Thesis, Delft University of Technology, The Netherlands. Cited by: §I, 2b, 2b, §II-B, §VII.
-  (2016) Deep learning. MIT Press. Cited by: §I.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §V-C.
-  (2017-03) Attacking machine learning with adversarial examples. OpenAI Blog. External Links: Cited by: §V-C.
-  (2018) Symbolic execution for deep neural networks. arXiv preprint arXiv:1807.10439. Cited by: §VII.
-  (2017) Explainable artificial intelligence (XAI) – program information. Note: https://www.darpa.mil/program/explainable-artificial-intelligenceDefense Advanced Research Projects Agency Cited by: §III.
-  (2017) Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web. Cited by: §I.
-  (2005) Causes and explanations: a structural-model approach. Part I: causes. 56 (4). Cited by: §III.
-  (2005) Causes and explanations: a structural-model approach. Part II: explanations. 56 (4). Cited by: §III.
-  (2010) Test input reduction for result inspection to facilitate fault localization. Automated software engineering 17 (1), pp. 5. Cited by: §VII.
-  (2008) On similarity-awareness in testing-based fault localization. Automated Software Engineering 15 (2), pp. 207–249. Cited by: §VII.
-  (2009) Interactive fault localization using test information. Journal of Computer Science and Technology 24 (5), pp. 962–974. Cited by: §VII.
-  (1965) Aspects of scientific explanation. Free Press. Cited by: §III.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §V-A, §V-C.
-  (2011) On practical adequate test suites for integrated test case prioritization and fault localization. In 11th International Conference on Quality Software, pp. 21–30. Cited by: §I, §II-B, §VII.
-  (2005) Empirical evaluation of the Tarantula automatic fault-localization technique. In Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering, pp. 273–282. Cited by: §I, 2c, 2c, §II-B, §VII.
-  (2018) Guiding deep learning system testing using surprise adequacy. arXiv preprint arXiv:1808.08444. Cited by: §VII.
-  (2016) Practitioners’ expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis, pp. 165–176. Cited by: §VII.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §V-E, §VII.
-  (2016) Understanding the fatal Tesla accident on autopilot and the NHTSA probe. Electrek, July 1. External Links: Cited by: §I.
-  (2018) Design automation for intelligent automotive systems. In 2018 IEEE International Test Conference (ITC), pp. 1–10. Cited by: §VII.
-  (2015) Evaluation of measures for statistical fault localisation and an optimising scheme. In International Conference on Fundamental Approaches to Software Engineering, pp. 115–129. Cited by: §I, §II-B, §II-B, §VII.
-  (2018) Optimising spectrum based fault localisation for single fault programs using specifications.. In FASE, pp. 246–263. Cited by: §VII.
-  (2015) Information retrieval and spectrum based bug localization: better together. In Foundations of Software Engineering, pp. 579–590. Cited by: §VII.
-  (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: §VII.
-  (2016) Iterative user-driven fault localization. In Haifa Verification Conference, pp. 82–98. Cited by: §VII.
-  (2019) You cannot fix what you cannot find! An investigation of fault localization bias in benchmarking automated program repair systems. In Proceedings of the 12th IEEE International Conference on Software Testing, Verification and Validation, Cited by: §VII.
-  (2010) Comprehensive evaluation of association measures for fault localization. In 2010 IEEE International Conference on Software Maintenance, pp. 1–10. Cited by: §I, §II-B, §VII.
-  (2014) Fusion fault localizers. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, pp. 127–138. Cited by: §VII.
-  (2014) Extended comprehensive study of association measures for fault localization. Journal of software: Evolution and Process 26 (2), pp. 172–219. Cited by: §I, §II-B, §II-B, §VII.
-  (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. External Links: Cited by: §I, §III, §VII.
-  (2018) DeepGauge: comprehensive and multi-granularity testing criteria for gauging the robustness of deep learning systems. In Automated Software Engineering (ASE), pp. 120–131. Cited by: §VII.
-  (2018) DeepMutation: mutation testing of deep learning systems. In Software Reliability Engineering, IEEE 29th International Symposium on, Cited by: §VII.
-  (2018) Combinatorial testing for deep learning systems. arXiv preprint arXiv:1806.07723. Cited by: §VII.
-  (2011) A model for spectra-based software diagnosis. ACM Transactions on software engineering and methodology (TOSEM) 20 (3), pp. 11. Cited by: §I, §II-B, §II-B, §VII.
-  (1957) Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. Bulletin of Japanese Society of Scientific Fisheries 22, pp. 526–530. Cited by: §I, 2a, 2a, §II-B, §VII.
-  (2018) The building blocks of interpretability. Distill. Note: https://distill.pub/2018/building-blocks External Links: Cited by: §I.
-  (1988) Probabilistic reasoning in intelligent systems. Morgan Kaufmann. Cited by: §III.
-  (2017) DeepXplore: automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, pp. 1–18. Cited by: §VII.
-  (2017) A test-suite diagnosability metric for spectrum-based fault localization approaches. In Proceedings of the 39th International Conference on Software Engineering, pp. 654–664. Cited by: §I, §II-B, §VII.
-  (2019) Machine behaviour. Nature 568 (7753), pp. 477. Cited by: §I.
-  (2016) Why should I trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §I, §VII.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §V-A.
-  (1989) Four decades of scientific explanation. University of Minnesota Press. Cited by: §III.
-  (2009) Lightweight fault-localization using multiple coverage types. In Proceedings of the 31st International Conference on Software Engineering, pp. 56–66. Cited by: §I, §II-B, §VII.
-  (2016) Grad-CAM: why did you say that? Visual explanations from deep networks via gradient-based localization. In NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems, Cited by: §VII.
-  (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 3145–3153. Cited by: §I.
-  (2017) Learning important features through propagating activation differences. In Proceedings of Machine Learning Research 70:3145-3153, Cited by: §VII.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §V-A.
-  (2017-07) FLUCCS: using code and change metrics to improve fault localisation. In Proceedings of International Symposium on Software Testing and Analysis, ISSTA 2017, pp. 273–283. Cited by: §VII.
One pixel attack for fooling deep neural networks.
IEEE Transactions on Evolutionary Computation. Cited by: §V-C.
-  (2018) Testing deep neural networks. arXiv preprint arXiv:1803.04792. Cited by: §VII.
-  (2018) Concolic testing for deep neural networks. In Automated Software Engineering (ASE), 33rd IEEE/ACM International Conference on, Cited by: §VII.
-  (2017) Axiomatic attribution for deep networks. In International Conference on Machine Learning, Cited by: §VII.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §V-A.
-  (2014) Intriguing properties of neural networks. In In ICLR, Cited by: §V-C.
-  (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pp. 303–314. Cited by: §VII.
-  (2018) Automated directed fairness testing. In Automated Software Engineering (ASE), 33rd IEEE/ACM International Conference on, Cited by: §VII.
-  (2019) Adversarial sample detection for deep neural network through model mutation testing. In Proceedings of the 41st International Conference on Software Engineering, Cited by: §VII.
-  (2018) Detecting adversarial samples for deep neural networks through mutation testing. arXiv preprint arXiv:1805.05010. Cited by: §VII.
-  (2014) Version history, similar report, and structure: putting them together for improved bug localization. In Proceedings of the 22nd International Conference on Program Comprehension, pp. 53–63. Cited by: §VII.
-  (2014) Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 181–190. Cited by: §VII.
-  (2010) A family of code coverage-based heuristics for effective fault localization. Journal of Systems and Software 83 (2), pp. 188–208. Cited by: §I, §II-B, §VII.
-  (2016) A survey on software fault localization. IEEE Transactions on Software Engineering 42 (8), pp. 707–740. Cited by: §I, §II-B, §II-B, §VII.
-  (2007) Effective fault localization using code coverage. In 31st Annual International Computer Software and Applications Conference (COMPSAC 2007), Vol. 1, pp. 449–456. Cited by: §I, 2d, 2d, §II-B, §VII.
-  (2016) Automated debugging considered harmful: a user study revisiting the usefulness of spectra-based fault localization techniques with professionals using real bugs from large systems. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 267–278. Cited by: §VII.
-  (2017) A theoretical analysis on cloning the failed test cases to improve spectrum-based fault localization. Journal of Systems and Software 129, pp. 35–57. Cited by: §I, §II-B, §VII.
-  (2018) DeepRoad: GAN-based metamorphic autonomous driving system testing. In Automated Software Engineering (ASE), 33rd IEEE/ACM International Conference on, Cited by: §VII.
-  (2018) An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, Cited by: §VII.
-  (2012) Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In 2012 34th International Conference on Software Engineering (ICSE), pp. 14–24. Cited by: §VII.
-  (2016) A Google self-driving car caused a crash for the first time. The Verge. External Links: Cited by: §I.
-  (2019) An empirical study of fault localization families and their combinations. IEEE Transactions on Software Engineering. Cited by: §II-B, §VII.