Self-explaining AI as an alternative to interpretable AI

02/12/2020
by   Daniel C. Elton, et al.
63

The ability to explain decisions made by AI systems is highly sought after, especially in domains where human lives are at stake such as medicine or autonomous vehicles. While it is always possible to approximate the input-output relations of deep neural networks with human-understandable rules, the discovery of the double descent phenomena suggests that no such approximation will ever map onto the actual functioning of deep neural networks. Double descent indicates that deep neural networks typically operate by smoothly interpolating between data points rather than by extracting a few high level rules. As a result neural networks trained on complex real world data are inherently hard to interpret and prone to failure if used outside their domain of applicability. To show how we might be able to trust AI despite these problems, we introduce the concept of self-explaining AI. Self-explaining AIs are capable of providing a human-understandable explanation of each decision along with confidence levels for both the decision and explanation. Some difficulties to this approach along with possible solutions are sketched. Finally, we argue it is also important that AI systems warn their user when they are asked to perform outside their domain of applicability.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

02/12/2020

Self-explainability as an alternative to interpretability for judging the trustworthiness of artificial intelligences

The ability to explain decisions made by AI systems is highly sought aft...
06/16/2020

Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey

Nowadays, deep neural networks are widely used in mission critical syste...
01/21/2020

Explaining Data-Driven Decisions made by AI Systems: The Counterfactual Approach

Lack of understanding of the decisions made by model-based AI systems is...
09/02/2020

Estimating the Brittleness of AI: Safety Integrity Levels and the Need for Testing Out-Of-Distribution Performance

Test, Evaluation, Verification, and Validation (TEVV) for Artificial Int...
09/16/2021

Explainability Requires Interactivity

When explaining the decisions of deep neural networks, simple stories ar...
04/09/2019

Software and application patterns for explanation methods

Deep neural networks successfully pervaded many applications domains and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is growing interest in developing methods to explain deep neural network function, especially in high risk areas such as medicine and driverless cars. Such explanations would be useful to ensure that deep neural networks follow known rules and when troubleshooting failures. Despite much work in the area of model interpretation, the techniques that have been developed all have major flaws, often leading to much confusion regarding their use [35, 26]. Even more troubling, though, is that a new understanding is emerging that deep neural networks function through the interpolation of data points, rather than extrapolation [18]. This calls into question long-held narratives about deep neural networks “extracting” high level features and rules, and suggests that current methods of explanation will always fall short of explaining how deep neural networks actually work.

In response to difficulties raised by explaining black box models, Rudin argues for developing better interpretable models instead, arguing that the “interpetability-accuracy” trade-off is a myth. While it is true that the notion of such a trade-off is not rigorously grounded, empirically in many domains the state-of-the art systems are all deep neural networks. For instance, most state-of-art AI systems for computer vision are not interpretable in the sense required of Rudin. Even highly distilled and/or compressed models which achieve good performance on ImageNet require at least 100,000 free parameters 

[24]. Moreover, the human brain also appears to be an overfit “black box” which performs interpolation, which means that how we understand brain function also needs to change [18]. If evolution settled on a model (the brain) which is uninterpretable, then we expect advanced AIs to also be of that type. Interestingly, although the human brain is a “black box”, we are able to trust each other. Part of this trust comes from our ability to “explain” our decision making in terms which make sense to us. Crucially, for trust to occur we must believe that a person is not being deliberately deceptive, and that their verbal explanations actually maps onto the processes used in their brain to arrive at their decisions.

Motivated by how trust works between humans, in this work we explore the idea of self-explaining AIs. Self-explaining AIs yield two outputs - the decision and an explanation of that decision. This idea is not new, and it is something which was pursued in expert systems research in the 1980s [40]. More recently Kulesza et al. introduced a model which offers explanations and studied how such models allow for “explainable debugging” and iterative refinement [21]

. However, in their work they restrict themselves to a simple interpretable model (a multinomial naive Bayes classifier). In this work we explore how to create trustworthy self-explaining AI for deep neural networks of arbitrary complexity.

After defining key terms, we discuss the challenge of interpreting deep neural networks raised by recent studies on generalization in deep learning. Then, we discuss how self-explaining AIs might be built. We argue that they should include at least three components - a measure of mutual information between the explanation and the decision, an uncertainty on both the explanation and decision, and a “warning system” which warns the user when the decision falls outside the domain of applicability of the system. We hope this work will inspire further work in this area which will ultimately lead to more trustworthy AI.

1.1 Interpretation, explanation, and self-explanation

As has been discussed at length elsewhere, different practitioners understand the term “intepretability” in different ways, leading to much confusion on the subject (for a detailed reviews, see[26] or [1] or [29]). The related term “explainability” is typically used in a synonymous fashion[35], although some have tried to draw a distinction between the two terms [22]. Here we take explanation/explainability and interpretation/interpretability to be synonymous. Murdoch et al. define an explanation as a verbal account of neural network function which is descriptively accurate and relevant [29]. By “descriptively accurate” they mean that the interpretation reproduces a large number of the input-output mappings of the model. The explanation may or may not map onto how the model works internally. Additionally, any explanation will be an approximation, and the degree of approximation which is deemed acceptable may vary depending on application. By “relevance”, what counts as a “relevant explanation” is domain specific – it must be cast in terminology that is both understandable and relevant to users. For deep neural networks, the two desiderata of accuracy and relevance appear to be in tension - as we try to accurately explain the details of how a deep neural network interpolates, we move further from what may be considered relevant to the user.

This definition of explanation in terms of input-output mapping contrasts with a second meaning of the term explanation which we may call mechanistic explanation. Mechanistic explanations abstract faithfully (but approximately) the actual data transformations occurring in the model. To consider why mechanistic explanations can be useful, consider a deep learning model we trained recently to segment the L1 vertebra [9]

. The way a radiologist identifies the L1 vertebra is by scanning down from the top of the body and finding the last vertebra that has ribs attached to it, which is T12. L1 is directly below T12. In our experience our models for identifying L1 tend to be brittle, indicating they probably use a different approach. For instance, they may do something like “locate the bright object in the middle of the image” or “locate the bright object which is just above the kidneys”. These techniques would not be as robust as the technique used by radiologists. If a self-explaining AI or AGI could describe human anatomy at a high level as a radiologist can, this would go a long way towards engendering trust.

There is another type of explanation we wish to discuss which we may call meta-level explanation. Richard P. Feynman said “What I cannot create, I do not understand”. Since we can create deep neural networks, we do understand them, in the sense of Feynman, and therefore we can explain them in terms of how we build them. More specifically, we can explain neural network function in terms of four components necessary for creating them - data, network architecture, learning rules, and objective functions [33]. The way one explains deep neural network function from data, architecture, and training is analogous to how one explains animal behaviour using the theory of evolution. The evolution of architectures by “graduate student descent” and the explicit addition of inductive biases mirrors the evolution of organisms. Similarly, the training of architectures mirrors classical conditioning of animals as they get older. The explanation of animal behaviour in terms of meta-level theories like evolution and classical conditioning has proven to be enormously successful and stands in contrast to attempts to seek detailed mechanistic accounts.

Finally, the oft-used term black box also warrants discussion. The term is technically a misnomer since the precise workings of deep networks are fully transparent from their source code and network weights, and therefore for sake of rigor should not be used. A further point is that even if we did not have access to the source code or weights (for instance for intellectual property reasons, or because the relevant technical expertise is missing), it is likely that a large amount of information about the network’s function could be gleaned through careful study of the its input-output relations. Developing mathematically rigorous techniques for “shining lights” into “black boxes” was a popular topic in early cybernetics research [2]

, and this subject is attracting renewed interest in the era of deep learning. As an example of what is achievable, recently it has been shown that weights can be inferred for ReLU networks through careful analysis of input-output relations 

[34]. One way of designing a “self-explaining AI” would be to imbue the AI with the power to probe its own input-output relations so it can warn its user when it may be making an error and (ideally) also distill its functioning into a human-understandable format.

1.2 Why deep neural networks are inherently non-interpretable

Many methods for interpretation of deep neural networks have been developed, such as sensitivity analysis (also called “saliency maps”), iterative mapping,[6] “distilling” a neural network into a simpler model [11], exploring failure modes and adversarial examples [13, 15], visualizing filters in CNNs [43], activation maximization based visualizations,[10] influence functions [20], Shapley values [27], Local Interpretable Model-agnostic Explanations (LIME) [32], DeepLIFT [38], explanatory graphs [45], and layerwise relevance propagation [3]. Yet, all of these methods capture only particular aspects of neural network function, and the outputs of these methods are very easy to misinterpret [35, 23, 42]. Many of these methods are also unstable and not robust to small changes [8, 42]. Yet, deep neural networks are here to stay, and we expect them to become even more complex and inscrutable as time goes on. As explained in detail by Lillicrap & Kording [24], attempts to compress deep neural networks into a simpler interpretable models with equivalent accuracy are doomed to fail when working with complex real world data such as images or human language. If the world is messy and complex, then neural networks trained on real world data will also be messy and complex.

On top of these issues, there is a more fundamental issue when it comes to giving explanations for deep neural network function. For some years now it has been noted that deep neural networks have enormous capacity and seem to be vastly underdetermined, yet they still generalize. This was shown very starkly in 2016 when in Zhang et al. showed how deep neural networks can memorize random labels on ImageNet images [44]

. More recently it has been shown that deep neural networks operate in a regime where the bias-variance trade-off no-longer applies 

[4]. As network capacity increases, test error first bottoms out and then starts to increase, but then (surprisingly) starts to decrease after a particular capacity threshold is reached. Belkin et al. call this the “double descent phenomena” [4] and it was also noted in an earlier paper by Sprigler et al [39]

, who argue the phenomena is analogous to the “jamming transition” found in the physics of granular materials. The phenomena of “double descent” appears to be universal to all machine learning 

[4, 5], although its presence can be masked by common practices such as early stopping [4, 30], which may explain why it took so long to be discovered.

In the regime where deep neural networks operate, they not only interpolate each training data point, but do so in a “direct” or “robust” way [18]. This means that the interpolation does not exhibit overshoot or undershoot which is typical of overfit models, rather it is a smooth almost piecewise interpolation. Interpolation also brings brings with it a corollary - that they can’t extrapolate. The fact that deep neural networks cannot extrapolate calls into question popular ideas that deep neural networks “extract” high level features and “discover” regularities in the world. Actually, deep neural networks are “dumb” - any regularities that they appear to have captured internally are solely due to the data that was fed to them, rather than a self-directed “regularity extraction” process.

1.3 How can we trust a self-explaining AI’s explanation?

In his landmark 2014 book Superintelligence: Paths, Dangers, Strategies, Nick Bostrom notes that highly advanced AIs may be incentivized to deceive their creators until a point where they exhibit a “treacherous turn” against them [7]

. In the case of superintelligent or otherwise highly advanced AI, the possibility of deception appears to be a highly non-trivial concern. Here however, we suggest some methods by which we can trust the explanations given by present day deep neural networks, such as typical convolutional neural networks or transformer language models. Whether these methods will still have utility when it comes to future AI systems is an open question.

To show how we might create trust, we focus on an explicit and relatively simple example. Shen et al[37] and later LaLonde et al [22]. have both proposed deep neural networks for lung nodule classification which offer “explanations”. Both authors make use of a dataset where clinicians have labeled lung nodules not only by severity (cancerous vs non-cancerous) but also quantified them (on a scale of 1-5) in terms of five visual attributes which are deemed relevant for diagnosis (subtlety, sphericity, margin, lobulation, spiculation, and texture). While the details of the proposed networks vary greatly, both output predictions for severity and scores for each of the visual attributes. Both authors claim that the visual attribute predictions “explain” the diagnostic prediction, since the diagnostic branch and visual attribute prediction branch(es) are connected near the base of the network. However, no evidence is presented that the visual attribute prediction is in any way related to the diagnosis prediction.111

We note that while it may seem intuitive that the two output branches must be related, this must be rigorously shown for trustworthiness to hold. Non-intuitive behaviours have repeatably been demonstrated in deep neural networks, for instance it has been shown that commonly used networks based on rectified linear units (ReLUs) contain large “linear regions” in which the power of network non-linearity is not utilized 

[17, 16]. Indeed, even state-of-the art models likely contain many unused or redundant connections, as evidenced by the repeated success of model compression techniques when applied to state-of-the-art image classification models. (The output activations of the last layer shared by both branches could be computed in a largely independent manner.) Additionally, even if the visual attributes were used, no weights are provided for the importance of each attribute to the prediction, and there may be other attributes/features of equal or greater importance that are used but not among those outputted.222This weakness is acknowledged by Shen et al., who point out there are a multitude of other features known to be relevant which are not outputted, most notably location in the body, which is strongly associated with malignancy [37].

Figure 1: Sketch of a simple self-explaining AI system. Optional components are shown with dashed lines.

We would like to show that the attributes used in the explanation and the diagnosis output are related. This may be done by looking at the layer where the diagnosis and explanation branch diverge. There are many ways of quantifying the relatedness of two variables, the Pearson correlation being one of the simplest, but also one of the least useful in this context since it is only sensitive to linear relationships. A measure which is sensitive to non-linear relationships and which has nice theoretical interpretation is the mutual information. For two random variables

and it is defined as:

(1)

Where is the Shannon entropy. One can also define a mutual information correlation coefficient:[25]

(2)

This coefficient has the nice property that it reduces to the Pearson correlation in the case that

is a Gaussian function with non-zero covariance. The chief difficulty of applying mutual information is that the underlying probability distributions

, , and

all have to be estimated. Various techniques exist for doing this however, such as by using kernel density estimation with Parzen windows 

[41].333Note that this sort of approach should not be taken as quantifying “information flows” in the network. In fact, since the output of units is continuous, the amount of information which can flow through the network is infinite (for discussion of this and how to recover the concept of “information flow” in neural networks see [14]). What we are doing is measuring the mutual information over the particular data distribution used.

Suppose the latent vector is denoted by

and has length . Denote the diagnosis of the network as and the vector of attributes . Then for a particular attribute in our explanation word set we calculate the following to obtain a “relatedness” score between the two:

(3)

An alternative (and perhaps complementary) method would be to train a surrogate (“post-hoc”) model to try to predict the diagnosis from the attributes (also shown in figure 1). We can learn two things from this surrogate model. First, if the model is not as accurate as the diagnosis branch of the main model, then we know the main model is using additional features. Secondly, we can change or scramble a particular attribute and see if the output of this model changes, on average.

1.4 Ensuring robustness through applicability domains and uncertainty analysis

The concept of an “applicability domain”, or the domain where a model makes good predictions, is well known in the area of molecular modeling known as quantitative structure property relationships (QSPR), and a number of techniques have been developed (for a review, see [36] or [31]). However, the practice of quantifying the applicability domain of models hasn’t become widespread in other areas where machine learning is applied. A simple way of defining the applicability domain by calculating the convex hull of the latent vectors for all training data points. If the latent vector of a test data point falls on or outside the convex hull, then the model should send an alert saying that the test point falls outside the domain it was trained for.

Finally, models should contain measures of uncertainty for both their decisions and their explanations. Ideally, this should be done in a Bayesian way using a Bayesian neural network [19]. With the continued progress of Moore’s law, training Bayesian CNNs [28] is now becoming feasible and in our view this is a worthwhile use of additional CPU/GPU cycles. There are also approximate methods - for instance it has been shown that random dropout during inference can be used to estimate uncertainties at little extra computational cost [12]. Just as including experimental error bars is standard in all of science, and just as we wouldn’t trust a doctor who could not also give a confidence level in his diagnosis, uncertainty quantification should be standard practice in AI research.

1.5 Conclusion

We argued that deep neural networks trained on complex real world data are very difficult to interpret due to their power arising from brute-force interpolation over big data rather than through the extraction of high level generalizable rules. Motivated by this and by the need for trust in AI systems we introduce the concept of self-explaining AI. We described how a simple self explaining AI would function for diagnosing medical images such as chest X-rays or CT scans. To build trust, we showed how a mutual information metric can be used to verify that the explanation given is related to the diagnostic output. Crucially, in addition to an explanation, self-explaining AI outputs confidence levels for both the decision and explanation, further aiding our ability to gauge the trustworthiness of any given diagnosis. Finally, we argue that an applicability domain analysis should be done for AI systems where robustness and trust are important, and that systems should alert the user if they are asked to do work outside their domain of applicability.

2 Funding & disclaimer

No funding sources were used in the creation of this work. The author (Dr. Daniel C. Elton) wrote this article in his personal capacity. The views expressed are his own and do not necessarily represent the views of the National Institutes of Health or the United States Government.

References