With the growing success of complex predictors, and their resulting expanding reach into high-stakes and decision-critical applications, wringing explanations out of these models has become a central problem in artificial intelligence (AI). Countless methods have been recently proposed to produce such explanations(simonyan2013deep; selvaraju2016grad; Ribeiro2016Why; lundberg2017unified; alvarez-melis2017causal), yet there is no consensus on what precisely makes an explanation of an algorithmic prediction good or useful. Meanwhile, what it means to explain and how humans do it are questions that have been long studied in philosophy and cognitive science. Since the end goal of explainable AI is to explain to humans, this literature seems an appropriate starting point when looking for principles upon which a theory of machine interpretability might rest.
While the debate on the nature of (human) explanation is far from settled, various fundamental principles arise across theoretical frameworks. For example, at the core of Van Fraassen’s (vanfraassen1988pragmatic) and Lipton’s (lipton1990contrastive) theories of explanation is the hypothesis that we tend to explain in contrastive terms (e.g., “fever is more consistent with pneumonia than with a common cold"), focusing on both factual and counterfactual explanations (e.g., “had this patient had chest pressure too, the diagnosis would have instead been bronchitis"). On the other hand, both of Hempel’s models of explanation (hempel1962deductive) are characterized by sequences of simple premises, reflecting the fact that humans usually explain using multiple simple accumulative statements, each one addressing a few aspects of the evidence (e.g., “presence of fever rules out cold in favor of bronchitis or pneumonia, and among these two, the presence of chills suggests the latter"). These and other fundamental principles have been observed across disciplines in the social sciences. In a recent survey of over 250 papers in philosophy, psychology and cognitive science on explanation, miller2019explanation mentions contrastiveness and selectivity (i.e., that only few possible cases are presented) as two major properties of the way humans explain things that he argues are important for explainable AI but yet are currently under-appreciated.
These principles are often missing in popular explainable AI frameworks. Their explanations consist of saliency or attribution scores that are absolute (i.e., non-contrastive, focused only on the predicted outcome), purely factual (i.e., based only on aspects present in the input, ignoring counter-factuals), and monolithic
(i.e., simultaneously explicating all input features). On the other hand, those that provide probabilistic explanations often present only posterior probabilities, which conflate class priors (also known as base rates) with per-class likelihoods, and which humans are notoriously bad at reasoning about(tversky1974judgment; bar-hillel1980base; eddy1982probabilistic; koehler1996base). Furthermore, as Miller (miller2019explanation) and others argue, attribution is only an important but incomplete part of the entire process of human explanation. This novel view of explanation as a process rather than (only) a product, is crucial for understanding the discrepancy between current approaches to automated interpretability and the way humans explain.
In this work, we lay out a general framework for interpretability that aims to reconcile this discrepancy. The starting point of this approach, and our first contribution, is a set of intuitive desiderata that we argue are crucial for bringing machine explanations closer to their human counterparts. With these considerations at hand, we develop a mathematical framework to realize them. At the core of this framework is the concept of weight of evidence from information theory, which we show provides a suitable theoretical foundation to the often elusive notion of model interpretability.111While the use of weight of evidence for algorithmic explainability has previously been advocated by David Spiegelhalter (e.g., in his keynote talk at NeurIPS 2018 (spiegelhalter2018neurips)), to the best of our knowledge it has not yet been instantiated or investigated in the context of complex machine learning models. After introducing this concept, we extend it beyond its original formulation to account for the type of settings in machine learning where interpretability is most needed (e.g., high-dimensional, multi-class prediction). We provide a generic meta-algorithm to produce explanations based on the weight of evidence, and show its instantiation on simple proof-of-point experimental settings.
Some of the shortcomings of machine explanations highlighted here have been individually tackled in prior work. For example, recent work seeks to move from absolute to contrastive or counterfactual explanations (wachter2017counterfactual; miller2018contrastive; vanderwaa2018contrastive), partly inspired by earlier approaches on contrast set mining (Azevedo2010Rules; Bay1999Detecting; webb2003detecting; Novak2009Supervised)
. On the other hand, while most saliency-based methods produce dense high-dimensional attributions, explanations supported on a sparse set of input features is a much-touted benefit of classic (model-based) interpretability, such as decision trees(quinlan1986induction) and sets (Lakkaraju2016Interpretable). Recent work has also explored improving interpretability by explaining on higher-level concepts (e.g., super-pixels or patterns in an image) rather than raw inputs (kim2018interpretability; alvarez-melis2018towards). Our approach shares motivation but many of these works, but differs substantially in how the salient features are selected and scored.
2 Desiderata for Human-Oriented Explanations
The first step towards defining any method for explainable machine learning should be to define its goal with precision, i.e., what is an explanation? For this, we draw on basic principles and terminology from epistemology and philosophy of science. In its most abstract form, an explanation is an answer to a why-question (hempel1948studies; vanfraassen1988pragmatic) consisting of two main components: the explanandum, the description of a phenomenon to be explained; and the explanans, that which gives the explanation of the phenomenon (hempel1948studies). Different ways to formalize the explanans have given rise to various theories of explanation; an excellent historical overview of these can be found surveys by pitt1988theories and miller2019explanation. For the purposes of this work, our definition of interpretability follows that of biran2017explanation and miller2019explanation: the degree to which an observer can understand the cause of a decision.
In the context of machine learning, we are usually interested in explanations of predictive models. For a predictor that takes inputs drawn according to some distribution and produces outputs , we seek an explanation for , that is, "why did model predict on input ?" Here, we are primarily interested in probabilistic—or more generally, soft—predictors, which covers a wide range of machine learning methods. Namely, we consider models that rather than producing a single prediction , instead return a predictive posterior distribution . Furthermore, to allow for more general explananda (e.g., why was this subset of outcomes ruled out?), we take inspiration from hypothesis testing and consider complex hypotheses of the form , and—slightly abusing notation—denote the posterior as .
Having described what type of explanandum we consider in this work, we must now characterize the explanans we seek. The first such consideration pertains to the causes or evidence which are to define the “vocabulary” from which the explanans is constructed. Popular explanation-based interpretability methods rely directly on the raw inputs . Likewise, we initially consider evidence of the form , but will later generalize to more general attributes (e.g., subsets of features). Extending this definition to include higher-level representations of the input (kim2018interpretability; alvarez-melis2018towards) or even aspects of the model itself (e.g., parameters) are natural extensions that we leave for future work.
Having formalized its ingredients, we now discuss what properties the explanans should have. Recall that our objective is to devise machine interpretability methods that are intelligible to humans. Our survey of literature on explanations above highlighted various aspects that characterize human explanations, but which most current machine explanations lack. Based on these, we propose a set of desiderata for bridging the gap between the former and the latter. Namely, explanations should:
be contrastive, i.e., answer the question “why did model predict instead of ?".
be modular and compositional, which is particularly important whenever the relations between inputs and outputs/predictions are complex – precisely when interpretability is most needed.
not confound base rates with input likelihood
, i.e., while important for fully understanding a classifier, base rates should be presented separately from input relevance towards the predictions.
be exhaustive, i.e., they should explicate why every other alternative was not predicted.
be minimal, i.e., all things being equal the simpler of two explanations should be preferred.
Next, we propose an interpretability framework based on the weight of evidence —a basic but fundamental concept from information theory— that satisfies all of the desiderata above.
3 Explaining with the Weight of Evidence
3.1 Weight of Evidence: from Information Theory to Bayesian Statistics
The weight of evidence (WoE) is an information-theoretic approach to analyze variable effects in prediction models (good1950probability; good1968corroboration; good1985weight)
. Although originally defined in terms of log-odds (see supplement), the weight of evidence for a hypothesisin the presence of evidence can be conveniently defined as . The interpretation of this quantity is simple. If then is more likely under than marginally, i.e., the evidence speaks in favor of hypothesis . Analogously, indicates is less likely when taking into account the evidence than without it.
The WoE can be conditioned on additional information: , and can be computed relative to an arbitrary alternative hypothesis (i.e., not necessarily the complement): . Thus, we can in general talk about the evidence in favor of and against provided by (and perhaps conditioned on ). Further properties and an axiomatic derivation of WoE are provided in the supplement. An appealing aspect of the WoE is its immediate connection to Bayes’ rule. For this, consider the binary classification setting, i.e., , and . Simple algebraic manipulation of the definition of WoE yields:
This provides another useful interpretation of the WoE in classification: a positive (negative, resp.) WoE implies that the posterior log-odds (of over ) are higher (lower) than the base log-odds, showing that —the evidence—speaks in favor of (against) the hypothesis .
Besides being intuitive and well-understood, the WoE provides an appealing framework for machine interpretability because it immediately satisfies three of the interpretability desiderata introduced in the previous section: it is naturally contrastive ( quantifies the evidence in favor of against ), it decouples base log-odds from variable importance (Eq. (1)) and it admits a modular decomposition (Eq. (5) in the supplement). We later show how the last two desiderata can be met.
3.2 Sequential Explanations: Explaining High-Dimensional Multi-Class Classifiers
The weight of evidence has been mostly used in simple settings, such as a single binary outcome variable and a single input variable . Its use in the (typically more complex) settings considered in machine learning poses various challenges. First, in multi-class classification one must choose the contrast hypotheses and . The trivial choice of letting be the predicted class and its complement is unlikely to yield interpretable explanations when the number of classes is very large (e.g., explaining the evidence in favor of one disease against 999 other possibilities). To address this, we take inspiration from Hempel’s model (hempel1962deductive) and propose to cast explanation as a sequential process, whereby a subset of the possible outcomes is expounded away in each step. For example, in medical diagnosis this could correspond to first explaining why bacterial diseases were ruled out in favor of viral ones, then contrasting between viral families, and finally between the predicted disease and similar alternatives. In general, we consider explanantia consisting of nested hypotheses , which imply contrastive tests .
A second challenge in using WoE for complex prediction tasks arises from the size of the input. While the decomposition formula (Eq. (5)) allows us to produce individual scores for each feature, for high-dimensional inputs (such as in images or detailed health records), providing a WoE score for every single feature simultaneously will rarely be informative. Thus, we propose grouping the inputs into attributes (e.g., super-pixels for images or groups of related symptoms for medical diagnosis). Formally, we partition the set of input features into subsets: .
Given these two extensions of the WoE, we propose a simple meta-algorithm for generating explanations for classifiers. At every step, a subset of the classes is selected to keep (the rest are ruled out), and is computed using the decomposition formula (5). The user is presented with only the most relevant attributes (cf. desideratum 5) according to their WoE (e.g., using the rule-of-thumb threshold of (good1985weight)), in addition to the base log-odds . This process continues until all classes except the predicted one have been "ruled out" (desideratum 4). It is important to note that unless the predictor is generative—and not black-box—this process requires estimating the conditionals
. A discussion on estimation, in addition to pseudo-code for this method (Algo.1) and details about its implementation are provided in the supplement.
We first illustrate our framework in a simplified setting with exact
WoE computation (i.e., without estimation) using a Gaussian Naive Bayes classifier, which intrinsically computesas part of its prediction rule. We use the Wisconsin Breast Cancer dataset, grouping the 30 scalar-valued features into 10 attributes according to their type (mean/s.e./worst area area, etc.). In the example in Fig. 1 (left), the model predicts malignant despite initial log-odds, radius, and area speaking slightly against it, because the cell’s concavity and compactness attributes speak very strongly in favor of malignancy.
In our second experiment, we use our framework to explain the predictions of a black-box neural-net MNIST classifier, estimating conditional probabilities via a masked autoregressive flow (MAF) model (papamakarios2017masked), using squared super-pixels as attributes. For the example explanation in Fig. 1 (right), the strong evidence in favor of classes 9,4 (against 3 or 7) clearly corresponds to parts of the image which would be uncharacteristic for examples of the latter classes.
5 Discussion and Extensions
We have proposed a set of desiderata for bridging the gap between the type of explanations provided by humans and current interpretability methods, and a promising framework to realize them based on the weight of evidence. The application of this concept to complex machine learning problems brings about various challenges, some of which we addressed here (high-dimensional inputs and multi-class classification), but many which remain, such as estimation of WoE scores for black-box models, selection of contrast hypotheses, and attribute design. Furthermore, since the ultimate beneficiary of these explanations is a human, the effect of the proposed solutions to these challenges —and all algorithmic choices— should be validated and compared through human evaluation.
Appendix A Desiderata for Interpretability in Further Detail
Our driving motivation in this work is to devise machine interpretability methods that emulate the way humans explain. In the introduction, we listed some aspects that characterize human explanations (and which most current machine explanations lack). Based on these, we proposed in Section 2 a set of desiderata for interpretabilty which are aimed at bridging the gap between the former and the latter. We discuss them here in much more detail.
D1. Explanations should be contrastive.
As mentioned before, several authors have proposed (and validated) that humans tend to explain in contrastive terms. To more faithfully emulate human cognition, machine explanations should be contrastive too. That is, the prototypical explanandum should not be “why did model predict ?”, but rather, “why did model predict instead of ?”. Despite how self-evident this might be, note that most current explanation methods are not contrastive. Instead, they explain the model’s prediction absolutely, leaving the contrast case undetermined or implicitly assuming it to be the complement of the predicted class.222For example, an explanation for a prediction of "9" by a digit classifier is to be interpreted as "why 9 and not any other digit"?
D2. Explanations should be modular and compositional.
Interpretabilty is most needed in applications where the inputs, outputs or the causal relations between them are complex (and therefore so is any interesting statistical model whose goal is to predict these). Yet, in these settings most explainable AI methods produce a single, high-dimensional static explanation for any given prediction (e.g., a heatmap for an image classifier). These are often hard to analyze and draw conclusions from, particularly for non-expert users. In addition, this form factor again differs from the manner in which humans tend to explain (hempel1962deductive): using various simple premises. Thus, instead of a single monolithic explanans, we seek a set of simple sub-clauses, each explicating a different aspect of the input-output predictive phenomenon. Clearly, this modularity introduces a trade-off between the number of clauses in the explanans (too many clauses might be difficult to coherently analyze simultaneously) and their relative complexity (small clauses are easier to reason about, but more of these might be required to explain a complex predictor). At a higher level, breaking up an explanation into a sequence of small components responds to our goals of moving from explanation as a product towards explanation as a process (lombrozo2012explanation) and to emulate the selective aspect of human explanation (hilton2017social; miller2019explanation).
D3. Explanations should not confound base rates with input likelihoods.
When explaining probabilistic models, any human-oriented framework for interpretability should take into account how humans understand and interpret probabilities. The psychological and cognitive science communities have long studied this topic (tversky1974judgment), showing, for example, that humans are notoriously bad at incorporating class priors when thinking about probabilities. The classic example of Breast Cancer diagnosis due to eddy1982probabilistic, showed that the majority of subjects (doctors) tended to provide estimates of posterior probabilities roughly one order of magnitude higher that the true values. This phenomenon has been attributed to a neglect of base-rates during reasoning (the base-rate fallacy (bar-hillel1980base)), or instead, to a confusion of inverse conditional probabilities and , one of which needs to be estimated and the other one is provided (the inverse fallacy, (koehler1996base)). Whatever the cause, we argue here that its effect—i.e., that humans often struggle to reason about posterior probabilities—should be taken into account. Thus, explanations should clearly separate the contribution of base-rates and per-class likelihoods linking inputs and predictions. We argue that while both of these are important for understanding a prediction, their very different natures (one of them dependent on the input of the other one not) necessitates different treatment. To the best of our knowledge, no currently available off-the-shelf interpretability framework provides this.
D4. Explanations should be exhaustive.
A conclusive justification for a predicted hypothesis should explicate why no other alternative hypothesis was predicted. In Hempel’s terminology, we seek explanations that are complete (not to be confused with complete in the sense of goodman2006intuitive, i.e., where all variables are explained). For example, an explanation for a pneumonia diagnosis simply stating the presence of cough is non-exhaustive, since cough by itself could be indicative of various other conditions beyond pneumonia, so their being non-predicted should be justified.
D5. Explanations should be minimal.
In following the original purpose of Occam’s Razor, if two explanations are of different complexity but otherwise identical (in particular, both sufficiently explicate the prediction), the simpler of these should be preferred. Furthermore, if omitting less-relevant aspects of an explanation makes the whole more intelligible, while remaining equally faithful to the prediction being explained, then then trimmed explanation should be preferred.
Appendix B The Weight of Evidence: Properties and Axiomatic Derivation
The weight of evidence is a fundamental concept that has been introduced in many contexts333The basic principle behind the weight of evidence appear in the work of both Alan Turing and Claude Shannon. However, good1985weight claims ideas like it go back to at least peirce1878probability)., although it is primarily associated with I.J. Good who popularized it through a long sequence of works (good1950probability; good1968corroboration; good1985weight). Good originally defined it as follows. For a hypothesis in the presence of evidence , the weight of evidence of is defined as
where are the log-odds, i.e.,
The interpretation of (2) is simple. If then is more likely under than marginally, i.e., the evidence speaks in favor of hypothesis . Analogously, indicates is less likely when taking into account the evidence than without it.
The WoE has various desirable theoretical properties. For example, good1985weight provides an axiomatic derivation for Definition 2, showing that it is (up to a constant) the only function of and that satisfies the following properties:
is a function of the likelihoods, i.e.,
The posterior is a function of the prior and , i.e.,
is additive (on the evidence) (indeed, )
The following two properties, which are easy to prove, are crucial for our extension into complex models in Section 3.2:
The first of these provides a simple expression to compute WoE scores. The second one will prove consequential to defining an intelligible extension of WoE to high dimensional inputs.
Appendix C Explanations of Complex Models via the Weight of Evidence
As mentioned in the main text, using the WoE framework for complex machine learning models brings about the challenge of keeping WoE scores interpretable despite (i) high-dimensional inputs and (ii) not necessarily binary output (e.g., multi-class classification) settings.
We address (ii) by sequentially contrasting (increasigly smaller) sets of classes (i.e., complex hypotheses), as described in the main text. Our solution for (i) in turn involves grouping the inputs into attributes, e.g., super-pixels in an image or groups of related symptoms in our running medical diagnosis example. Consider an input space of dimension , i.e., the evidence
corresponds now to a multivariate random variabletaking values in . Property (5
) (or alternatively, the chain rule of probability) allows for chaining of the conditional probabilities; hence, for any a partition of theinput features into attributes, we can express the WoE of hypothesis against as:
where denotes the subset of random variables with indices in , i.e., . The full Bayes-odds explanation model now has the form
Note that, in general, the order of the attributes matters in this sum. We discuss how to minimize the impact of this ordering in the next section. These two extensions lead to the meta-algorithm for WoE-based explanation shown here as Algorithm 1, which selects at each step the contrast set with maximal WoE plus a cardinality-based regularizing term to prevent too small or too large subsets from being selected. We use to encourage even partitions.
Appendix D Estimation of Weight of Evidence Scores
For any pair of entailed and contrast classes, and any partition of input into attributes, equation (5) provides an exact method to compute the conditional weight of evidence as a sum of per-attribute WoE scores. In order to use this expression, we need to be able to compute for any order of attributes and outcome set . In an ideal scenario (such as the Gaussian Naive Bayes classifier used in Section 4, or in an auto-regressive generative model for sequential data), the prediction model itself would compute and store these values.
Unfortunately, most prediction models do not compute such probabilities explicitly, so any realistic application of the WoE methodology to interpretability must provide a fallback method to estimate these, independently, from data. Let us consider the worst case scenario: a black-box prediction model, for which we assume we only have oracle access (i.e., queries of ), in addition to access to additional training data. In such case, the problem essentially turns to a conditional density estimation problem, where we seek to learn models of from training data in the form of pairs (), where is a sample from the input distribution and is a class label prediction obtained with the prediction model.
We propose to tackle this problem by training, as an off-line preliminary step, an auto-regressive conditional likelihood estimation model. For simple data, this could be done with classic (e.g., kernel or spectral) density estimation methods. For more complex data such as images, many recent methods have been proposed based on normalizing flows and autoregressive models(rezende2015variational; dinh2016density; papamakarios2017masked). For sequential data, the ordering of the attributes in Eq. (6) is implied by the data. For non-sequential inputs, the likelihood model should be trained to minimize the impact of ordering on the WoE scores (e.g., by training on random orderings). For the experiments on MNIST, we train a conditional Masked Autoregressive Flow (MAF) model (papamakarios2017masked), randomizing the order in which the pixel blocks (the attributes) are traversed, but keeping the order within each of these fixed (left-to-right, top-to-bottom).