A Survey Of Methods For Explaining Black Box Models

by   Riccardo Guidotti, et al.

In the last years many accurate decision support systems have been constructed as black boxes, that is as systems that hide their internal logic to the user. This lack of explanation constitutes both a practical and an ethical issue. The literature reports many approaches aimed at overcoming this crucial weakness sometimes at the cost of scarifying accuracy for interpretability. The applications in which black box decision systems can be used are various, and each approach is typically developed to provide a solution for a specific problem and, as a consequence, delineating explicitly or implicitly its own definition of interpretability and explanation. The aim of this paper is to provide a classification of the main problems addressed in the literature with respect to the notion of explanation and the type of black box system. Given a problem definition, a black box type, and a desired explanation this survey should help the researcher to find the proposals more useful for his own work. The proposed classification of approaches to open black box models should also be useful for putting the many research open questions in perspective.


Fairwashing: the risk of rationalization

Black-box explanation is the problem of explaining how a machine learnin...

TSXplain: Demystification of DNN Decisions for Time-Series using Natural Language and Statistical Features

Neural networks (NN) are considered as black-boxes due to the lack of ex...

Human Understandable Explanation Extraction for Black-box Classification Models Based on Matrix Factorization

In recent years, a number of artificial intelligent services have been d...

A Theory of Diagnostic Interpretation in Supervised Classification

Interpretable deep learning is a fundamental building block towards safe...

An Ethical Black Box for Social Robots: a draft Open Standard

This paper introduces a draft open standard for the robot equivalent of ...

Explaining Image Enhancement Black-Box Methods through a Path Planning Based Algorithm

Nowadays, image-to-image translation methods, are the state of the art f...

Polynomial Regression As an Alternative to Neural Nets

Despite the success of neural networks (NNs), there is still a concern a...

1 Introduction

The last decade has witnessed the rise of ubiquitous opaque decision systems. These black box systems exploit sophisticated machine learning models to predict individual information that may also be sensitive. We can consider credit score, insurance risk, health status, as examples. Machine learning algorithms build predictive models which are able to map user features into a class (outcome or decision) thanks to a learning phase. This learning process is made possible by the digital traces that people leave behind them while performing everyday activities (e.g., movements, purchases, comments in social networks, etc.). This enormous amount of data may contain human biases and prejudices. Thus, decision models learned on them may inherit such biases, possibly leading to unfair and wrong decisions.

The European Parliament recently adopted the General Data Protection Regulation (GDPR), which will become law in May 2018. An innovative aspect of the GDPR, which has been much debated, are the clauses on automated (algorithmic) individual decision-making, including profiling, which for the first time introduce, to some extent, a right of explanation for all individuals to obtain “meaningful explanations of the logic involved” when automated decision making takes place. Despite divergent opinions among legal scholars regarding the real scope of these clauses [goodman2016eu, wachter2017right, comande2017regulating], everybody agrees that the need for the implementation of such a principle is urgent and that it represents today a huge open scientific challenge. Without an enabling technology capable of explaining the logic of black boxes, the right to an explanation will remain a “dead letter”.

By relying on sophisticated machine learning models trained on massive datasets thanks to scalable, high-performance infrastructures, we risk to create and use decision systems that we do not really understand. This impacts not only information on ethics, but also on safety and on industrial liability. Companies increasingly market services and products by embedding machine learning components, often in safety-critical industries such as self-driving cars, robotic assistants, and personalized medicine. Another inherent risk of these components is the possibility of inadvertently making wrong decisions, learned from artifacts or spurious correlations in the training data, such as recognizing an object in a picture by the properties of the background or lighting, due to a systematic bias in training data collection. How can companies trust their products without understanding and validating the underlying rationale of their machine learning components? Gartner predicts that “by 2018 half of business ethics violations will occur through the improper use of Big Data analytics”. Explanation technologies are an immense help to companies for creating safer, more trustable products, and better managing any possible liability they may have. Likewise, the use of machine learning models in scientific research, for example in medicine, biology, socio-economic sciences, requires an explanation not only for trust and acceptance of results, but also for the sake of the openness of scientific discovery and the progress of research.

As a consequence, explanation is at the heart of a responsible, open data science, across multiple industry sectors and scientific disciplines. Different scientific communities studied the problem of explaining machine learning decision models. However, each community addresses the problem from a different perspective and provides a different meaning to

explanation. Most of the works in the literature come from the machine learning and data mining communities. The first one is mostly focused on describing how black boxes work, while the second one is more interested in explaining the decisions even without understanding the details on how the opaque decision systems work in general.

Despite the fact that interpretable machine learning has been a topic for quite some time and received recently much attention, today there are many ad-hoc scattered results, and a systematic organization and classification of these methodologies is missing. Many questions feed the papers in the literature proposing methodologies for interpreting black box systems [weller2017challenges, hofman2017prediction]: What does it mean that a model is interpretable or transparent? What is an explanation? When a model or an explanation is comprehensible? Which is the best way to provide an explanation and which kind of model is more interpretable? Which are the problems requiring interpretable models/predictions? What kind of decision data are affected? Which type of data records is more comprehensible? How much are we willing to lose in prediction accuracy to gain any form of interpretability?

We believe that a clear classification considering simultaneously all these aspects is needed to organize the body of knowledge about research investigating methodologies for opening and understanding the black box. Existing works tend to provide just a general overview of the problem [lipton2016mythos] highlighting unanswered questions and problems [doshi2017towards]. On the other hand, other works focus on particular aspects like the impact of representation formats on comprehensibility [huysmans2011empirical], or the interpretability issues in term of advantages and disadvantages of selected predictive models [freitas2014comprehensible]. Consequently, after recognizing four categories of problems and a set of ways to provide an explanation, we have chosen to group the methodologies for opening and understanding black box predictors by considering simultaneously the problem they are facing, the class of solutions proposed for the explanation, the kind of data analyzed and the type of predictor explained.

The rest of the paper is organized as follows. Firstly, in Section 3 we discuss what interpretability is. Section 2 show which are the motivations for requiring explanation for black box systems by illustrating some real cases. In Section 4 we formalize our problem definitions used to categorize the state of the art works. Details of the classification and crucial points distinguishing the various approaches and papers are discussed in Section 5. Sections 6, 7, 8 and 9 present the details of the solutions proposed. Finally, Section 10 summarizes the crucial aspects emerged from the analysis of the state of the art and discusses which are the open research questions and future research directions.

2 Needs for Interpretable Models

Which are the real problems requiring interpretable models and explainable predictions? In this section, we briefly report some cases showing how and why black boxes can be dangerous. Indeed, delegating decisions to black boxes without the possibility of an interpretation may be critical, can create discrimination and trust issues.

Training a classifier on historical datasets, reporting human decisions, could lead to the discovery of endemic preconceptions

[pedreshi2008discrimination]. Moreover, since these rules can be deeply concealed within the trained classifier, we risk to consider, maybe unconsciously, such practices and prejudices as general rules. We are warned about a growing “black box society” [pasquale2015black], governed by “secret algorithms protected by industrial secrecy, legal protections, obfuscation, so that intentional or unintentional discrimination becomes invisible and mitigation becomes impossible.”

Automated discrimination is not new and is not necessarily due to “black box” models. A computer program for screening job applicants were used during the 1970s and 1980s in St. George’s Hospital Medical School (London). The program used information from applicants’ forms, without any reference to ethnicity. However, the program was found to unfairly discriminate against ethnic minorities and women by inferring this information from surnames and place of birth, and lowering their chances of being selected for interview [lowry1988blot].

More recently, the journalists of propublica.org have shown that the COMPAS score, a predictive model for the “risk of crime recidivism” (proprietary secret of Northpointe), has a strong ethnic bias. Indeed, according to this score, a black who did not re-offend were classified as high risk twice as much as whites who did not re-offend, and white repeat offenders were classified as low risk twice as much as black repeat offenders111http://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.

Similarly, a study at Princeton [caliskan2016semantics] shows how text and web corpora contain human biases: names that are associated with black people are found to be significantly more associated with unpleasant than with pleasant terms, compared to names associated with whites. As a consequence, the models learned on such text data for opinion or sentiment mining have a possibility of inheriting the prejudices reflected in the data.

Another example is related to Amazon.com. In 2016, the software used to determine the areas of the US to which Amazon would offer free same-day delivery, unintentionally restricted minority neighborhoods from participating in the program (often when every surrounding neighborhood was allowed)222http://www.techinsider.io/how-algorithms-can-be-racist-2016-4.

With respect to credit bureaus, it is shown in [carter2006credit] that banks providing credit scoring for millions of individuals, are often discordant: in a study of records, of consumers received credit scores that differed by at least fifty points among three major US banks (Experian, TransUnion, and Equifax). Such a difference might mean tens of thousands of dollars over the life of a mortgage. So much variability implies that the three scoring systems either have a very different and undisclosed bias, or are highly arbitrary.

As example of bias we can consider [freitas2014comprehensible] and [ribeiro2016should]. In these works, the authors show how accurate black box classifiers may result from an accidental artifact in the training data. In [freitas2014comprehensible] the military trained a classifier to recognize enemy tanks from friendly tanks. The classifier resulted in a high accuracy on the test set, but when it was used in the field had very poor performance. Later was discovered that enemy photos were taken on overcast days, while friendly photos on sunny days. Similarly, in [ribeiro2016should] is shown that a classifier trained to recognize wolves and husky dogs were basing its predictions to classify a wolf solely on the presence of snow in the background.


Deep Neural Networks (DNNs)

have been reaching very good performances on different pattern-recognition tasks such as visual and text classification which are easily performed by humans: e.g., saying that a tomato is displaced in a picture or that a text is about a certain topic. Thus, what differences remain between DNNs and humans? Despite the excellent performance of DNNs it seems to be a lot. In

[szegedy2013intriguing] it is shown the alteration of an image (e.g. of a tomato) such that the change is undetectable for humans can lead a DNN to tag the image as something else (e.g., labeling a tomato as a dog). In [nguyen2015deep] a related result is shown. It is easy to produce images that DNNs believe to be recognizable with 99.99% confidence, but which are completely unrecognizable to humans (e.g., labeling white static noise as a tomato). Similarly in [koh2017understanding] visually-indistinguishable training-set are created using DNNs and linear models. With respect to text, in [liang2017deep] effective methods to attack DNN text classifiers are presented. Experiments show that the perturbations introduced in the text are difficult to be perceived by a human but are still able to fool a state-of-the-art DNN to misclassify a text as any desirable class. These results show interesting differences between humans and DNNs, and raise reasonable doubts about trusting such black boxes. In [zhang2016understanding] it is shown how conventional regularization and small generalization error fail to explain why DNNs generalize well in practice. Specifically, they prove that established state-of-the-art CNN trained for image classification easily fits a random labeling of the training data. This phenomenon occurs even if the true images are replaced by unstructured random noise.

3 Interpretable, Explainable and Comprehensible Models

Before presenting the classification of the problems addressed in the literature with respect to black box predictors, and the corresponding solutions and models categorization, it is crucial to understand what interpretability is. Thus, in this section, we discuss what an interpretable model is, and we analyze the various dimensions of interpretability as well as the desiderata for an interpretable model. Moreover, we also discuss the meaning of words like interpretability, explainability and comprehensibility which are largely used in the literature.

To interpret means to give or provide the meaning or to explain and present in understandable terms some concept333https://www.merriam-webster.com/. Therefore, in data mining and machine learning, interpretability is defined as the ability to explain or to provide the meaning in understandable terms to a human [doshi2017towards]. These definitions assume implicitly that the concepts expressed in the understandable terms composing an explanation are self-contained and do not need further explanations. Essentially, an explanation is an “interface” between humans and a decision maker that is at the same time both an accurate proxy of the decision maker and comprehensible to humans.

As shown in the previous section another significant aspect to mention about interpretability is the reason why a system, a service or a method should be interpretable. On the other hand, an explanation could be not required if there are no decisions that have to be made on the outcome of the prediction. For example, if we want to know if an image contains a cat or not and this information is not required to take any sort of crucial decision, or there are no consequences for unacceptable results, then we do not need an interpretable model, and we can accept any black box.

3.1 Dimensions of Interpretability

In the analysis of the interpretability of predictive models, we can identify a set of dimensions to be taken into consideration, and that characterize the interpretability of the model [doshi2017towards].

Global and Local Interpretability: A model may be completely interpretable, i.e., we are able to understand the whole logic of a model and follow the entire reasoning leading to all the different possible outcomes. In this case, we are speaking about global interpretability. Instead, we indicate with local interpretability the situation in which it is possible to understand only the reasons for a specific decision: only the single prediction/decision is interpretable.

Time Limitation: An important aspect is the time that the user is available or is allowed to spend on understanding an explanation. The user time availability is strictly related to the scenario where the predictive model has to be used. Therefore, in some contexts where the user needs to quickly take the decision (e.g., a disaster is imminent), it is preferable to have an explanation simple to understand. While in contexts where the decision time is not a constraint (e.g., during a procedure to release a loan) one might prefer a more complex and exhaustive explanation.

Nature of User Expertise: Users of a predictive model may have different background knowledge and experience in the task: decision-makers, scientists, compliance and safety engineers, data scientists, etc. Knowing the user experience in the task is a key aspect of the perception of interpretability of a model. Domain experts may prefer a larger and more sophisticated model over a smaller and sometimes more opaque one.

The works reviewed in the literature only implicitly specify if their proposal is global or local. Just a few of them take into account the nature of user expertise [gibbons2013cad, ribeiro2016should, schetinin2007confident], and no one provides real experiments about the time required to understand an explanation. Instead, some of the works consider the “complexity” of an explanation through an approximation. For example, they define the model complexity as the model’s size (e.g. tree depth, number of rules, number of conjunctive terms) [deng2014interpreting, hara2016making, johansson2004accuracy, ribeiro2016should]. In the following, we further discuss issues related to the complexity of an explanation.

3.2 Desiderata of an Interpretable Model

An interpretable model is required to provide an explanation. Thus, to realize an interpretable model it is necessary to take into account the following list of desiderata which are mentioned by a set of papers in the state of art [andrews1995survey, doshi2017towards, freitas2014comprehensible, johansson2004truth]:

  • Interpretability: to which extent the model and or the prediction are human understandable. The most addressed discussion is related to how the interpretability can be measured. In [freitas2014comprehensible] a component for measuring the interpretability is the complexity of the predictive model in terms of the model size. According to the literature, we refer to interpretability also with the name comprehensibility.

  • Accuracy: to which extent the model accurately predict unseen instances. The accuracy of a model can be measured using various evaluation measures like the accuracy score, the F1-score [tan2006introduction], etc. Producing an interpretable model maintaining competitive levels of accuracy is the most common target among the papers in the literature.

  • Fidelity: to which extent the model is able to accurately imitate a black-box predictor. The fidelity captures how much is good an interpretable model in the mimic of the behavior of a black-box. Similarly to the accuracy, the fidelity is measured in terms of accuracy score, F1-score, etc. but with respect to the outcome of the black box which is considered as an oracle.

Moreover, besides these features strictly related to interpretability, yet according to [andrews1995survey, doshi2017towards, freitas2014comprehensible, johansson2004truth] a data mining and machine learning model should have other important desiderata. Some of these desiderata are related to ethical aspects such as fairness and privacy. The first principle requires that the model guarantees the protection of groups against (implicit or explicit) discrimination [romei2014multidisciplinary]; while the second one requires that the model does not reveal sensitive information about people [aldeen2015comprehensive]. The level of interpretability of a model together with the standards of privacy and non-discrimination which are guaranteed may impact on how much human users trust that model. The degree of trust on a model increases if the model is built by respecting constraints of monotonicity given by the users [martens2011performance, pazzani2001acceptance, verbeke2011building]. A predictor respecting the monotonicity

principle is, for example, a predictor where the increase of the values of a numerical attribute tends to either increase or decrease in a monotonic way the probability of a record of being member of a class

[freitas2014comprehensible]. Another property that influences the trust level of a model is usability: people tend to trust more models providing information that assist them to accomplish a task with awareness. In this line, an interactive and queryable explanation results to be more usable than a textual and fixed explanation.

Data mining and machine learning models should have other important desiderata such as reliability, robustness, causality, scalability and generality. This means that a model should have the ability to maintain certain levels of performance independently from the parameters or from the input data (reliability/robustness) and that controlled changes in the input due to a perturbation affect the model behavior (causality). Moreover, since we are in the Big Data era, it is opportune to have models able to scale to large input data with large input spaces. Finally, since often in different application scenarios one might use the same model with different data, it is preferable to have portable models that do not require special training regimes or restrictions (generality).

3.3 Recognized Interpretable Models

In the state of the art a small set of existing interpretable models is recognized: decision tree, rules, linear models [freitas2014comprehensible, huysmans2011empirical, ribeiro2016should]. These models are considered easily understandable and interpretable for humans.

Figure 1: Example of decision tree.

A decision system based on a decision tree exploits a graph structured like a tree and composed of internal nodes representing tests on features or attributes (e.g., whether a variable has a value lower than, equals to or grater than a threshold, see Figure 1), and leaf nodes representing a class label. Each branch represents a possible outcome [quinlan1993c4]. The paths from the root to the leaves represent the classification rules. Indeed, a decision tree can be linearized into a set of decision rules with the if-then form [quinlan1987generating, quinlan1987simplifying, frank1998generating]:

Here, the outcome corresponds to the class label of a leaf node while the conjunctions of conditions in the if clause correspond to the different conditions in the path from the root to that leaf node.

More generally, a decision rule is a function which maps an observation to an appropriate action. Decision rules can be extracted by generating the so-called classification rules, i.e., association rules that in the consequence have the class label [agrawal1994fast]. The most common rules are if-then rules where the if clause is a combination of conditions on the input variables. In particular, it may be formed by conjunctions, negations and disjunctions. However, methods for rule extraction typically take into consideration only rules with conjunctions. Other types of rules are: m-of-n rules where given a set of conditions if of them are verified then the consequence of the rule is considered true [murphy1991id2]; list of rules where given an ordered set of rules is considered true the consequent of the first rule which is verified [yin2003cpar]; falling rule lists consists of a list of if-then rules ordered with respect to the probability of a specific outcome and the order identifies the example to be classified by that rule [wang2015falling]; decision sets where an unordered set of classification rules is provided such that the rules are not connected by else statements, but each rule is an independent classifier that can assign its label without regard for any other rules [lakkaraju2016interpretable].

The interpretation of rules and decision trees is different with respect to different aspects [freitas2014comprehensible]. Decision trees are widely adopted for their graphical representation, while rules have a textual representation. The main difference is that textual representation does not provide immediately information about the more relevant attributes of a rule. On the other hand, the hierarchical position of the features in a tree gives this kind of clue.

Attributes’ relative importance could be added to rules by means of positional information. Specifically, rule conditions are shown by following the order in which the rule extraction algorithm added them to the rule. Even though the representation of rules causes some difficulties in understanding the whole model, it enables the study of single rules representing partial parts of the whole knowledge (“local patterns”) which are composable. Also in a decision tree, the analysis of each path separately from the leaf node to the root, enables users to focus on such local patterns. However, if the tree is very deep in this case it is a much more complex task. A further crucial difference between rules and decision trees is that in a decision tree each record is classified by only one leaf node, i.e., the class predicted are represented in a mutually exclusive and exhaustive way by the set of leaves and their paths to the root node. On the other hand, a certain record can satisfy the antecedent of rules having as consequent a different class for that record. Indeed, rule based classifiers have the disadvantage of requiring an additional approach for resolving such situations of conflicting outcome [wettschereck1997review]. Many rule based classifiers deal with this issue by returning an ordered rule list, instead of an unordered rule set. In this way it is returned the outcome corresponding to the first rule matching the test record and ignoring the other rules in the list. We notice that ordered rule lists may be harder to interpret than classical rules. In fact, in this model a given rule cannot be considered independently from the precedent rules in the list [wettschereck1997review]. Another widely used approach consists in considering the top-k rules satisfying the test record where the ordering is given by a certain weight (e.g. accuracy, Laplace accuracy, etc.). Then, the outcome of the rules with the average highest weight among the top-k is returned as predicted class [yin2003cpar].

Figure 2: Example of feature importance for a linear model.

Finally, explanations can also be provided through linear models [kononenko2010efficient, ribeiro2016should]. This can be done by considering and visualizing both the sign and the magnitude of the contribution of the attributes for a given prediction (see Figure 2). If the contribution of an attribute-value is positive, then it contributes by increasing the model’s output. Instead, if the sign is negative then the attribute-value decreases the output of the model. If an attribute-value has an higher contribution than another, then it means that it has an higher influence on the prediction of the model. The produced contributions summarize the performance of the model, thus the difference between the predictions of the model and expected predictions, providing the opportunity of quantifying the changes of the model prediction for each test record. In particular, it is possible to identify the attributes leading to this change and for each attribute how much it contributed to the change.

As last remark we point out that in general, when an explanation for a prediction is provided, it is often useful to analyze besides the explanation (satisfied rules, branch of the tree, set of weights, etc.), also instances which are exceptions with respect to the “boundaries” provided by the explanation, or with very few differences with respect to the prototypes returned as explanation. For example, instances covered by the rule body but with an outcome label different from the class of the outcome predicted. Even though this sort of exception analysis is hardly performed, it can be more informative than the direct explanation, and it can also provide clues about the application domain [pappa2005predicting].

3.4 Explanations and Interpretable Models Complexity

In the literature, very little space is dedicated to a crucial aspect: the model complexity. The evaluation of the model complexity is generally tied to the model comprehensibility, and this is a very hard task to address. As a consequence, this evaluation is generally estimated with a rough approximation related to the size of the model. Moreover, complexity is often used as an opposed term to interpretability.

In [hara2016making] the complexity is identified by the number of regions, i.e., the parts of the model, for which the boundaries are defined. In [ribeiro2016should] as complexity for linear models is adopted the number of non-zero weights, while for decision trees the depth of the tree. In [deng2014interpreting] the complexity of a rule (and thus of an explanation) is measured by the length of the rule condition, defined as the number of attribute-value pairs in the condition. Given two rules with similar frequency and accuracy, the rule with a smaller length may be preferred as it is more interpretable. Similarly, in case of lists of rules the complexity is typically measured considering the total number of attribute-value pairs in the whole set of rules. However, this could be a suitable way for measuring the model complexity, since in an ordered rule list different test records need distinct numbers of rules to be evaluated [freitas2014comprehensible]. In this kind of model, a more honest measure could be the average number of conditions evaluated to classify a set of test records [otero2013improving]. However, this is more a “measure of the explanation” of a list of rules.

Differently from the not flexible representation of decision tree where the prediction of a single record is mutually exhaustive and exclusive, rules characterization contains only significant clauses. As a consequence, an optimal set of rules does not contain any duplicated information, given the fact that an outcome label can appear only one time in the consequent of a set of rules, while in a decision tree it typically comes out more than once. Moreover, rules do not capture insignificant clauses, while decision trees can also have insignificant branches. This happens because rule based classifier generally select one attribute-value while expanding a rule, whereas decision tree algorithms usually select one attribute while expanding the tree [freitas2014comprehensible]. Considering these aspects to estimate the complexity is very difficult. Consequently, even though a model equivalence exists, the estimation of the fact that a different representation for the same model (or explanation) is more complex than another when using decision trees or rules can be very subjective with respect to the interpreter.

3.5 Interpretable Data for Interpretable Models

The types of data used for classification may have diverse nature. Different types of data present a different level of interpretability for a human. The most understandable data format for humans is the table [huysmans2011empirical]

. Since matrices and vectors are the typical data representation used by the vast majority of data mining and machine learning techniques, tables are also easily managed by these algorithms without requiring specific transformations.

Other forms of data which are very common in human daily life are images and texts

. They are perhaps for human brain even more easily understandable than tables. On the other hand, the processing of these data for predictive models requires their transformation into vectors that make them easier to process by algorithms but less interpretable for humans. Indeed, on images and texts, the state of art techniques typically apply predictive models based on super vector machine, neural networks or deep neural networks that are usually hard to be interpreted. As a consequence, certain recognized interpretable models cannot be directly employed for this type of data in order to obtain an interpretable model or a human understandable explanation. Transformations using equivalences, approximations or heuristics are required in such a way that images and texts can be employed by prediction systems and used for providing the interpretation of the model and/or the prediction at the same time.

Finally, there exist other forms of data such as sequence data, spatio-temporal data and complex network data that may be used by data mining and machine learning algorithms. However in the literature, to the best of our knowledge, there is no work addressing the interpretability of models for data different from images, texts, and tabular data.

4 Open The Black Box Problems

An accurate analysis and review of the literature lead to the identification of different categories of problems.

At a very high level, we can distinguish between reverse engineering and design of explanations. In the first case, given the decision records produced by a black box decision maker the problem consists in reconstructing an explanation for it. The original dataset upon which the black box is trained is generally not known in real life. Details about reverse engineering approaches are discussed at the end of this section. On the other hand, it is used and exploited to build the explanations by most of the works presented in this survey. In the second case, given a dataset of training decision records the task consists in developing an interpretable predictor model together with its explanations.

Through a deep analysis of the state of the art we are able to further refine the first category obtaining three different problems. We name them black box model explanation problem, black box outcome explanation problem, and black box inspection problem. We name the second category transparent box design problem. All these problems can be formalized as specific cases of the general classification problems with the common target of providing an interpretable and accurate predictive model. Details of the formalization are provided in the following sections. Other important variants are generally not treated in the literature making the problem of discovering an explanation increasingly difficult: (i) Is it allowed to query the black box at will to obtain new decision examples, or only a fixed dataset of decision records is available? (ii) Is the complete set of features used by the decision model known, or instead only part of these features is known? In this survey we do not address these issues as in the literature there is not sufficient material.

4.1 Problem Formulation

In the following, we generalize the classification problem (see Figure 3).

A predictor, also named model or classifier, is a function where is the feature space with corresponding to the number of features, and is the target space. The feature space can correspond to any basic data type like the set of integers , reals , booleans , and strings , where and is the alphabet (a finite non-empty set of symbols). The feature space can also be a complex data type composed of different basic data type. For example, contains an integer feature, two real features and a string feature. On the other hand, the target space (with dimensionality equals to one) contains the different labels (classes or outcomes) and identifies a semantic concept where can be a set of booleans, integers or strings.

A predictor is the output of a learner function such that . The learner takes as input a dataset with samples where and and returns the predictor . Given a data record in the feature space , the predictor can be employed to predict the target value , i.e., .

Figure 3: Classification Problem.

Typically, in supervised learning

[tan2006introduction], a training dataset is used for training the learner which builds the predictor , and a test dataset is used for evaluating the performance of . Given , the evaluation is performed by observing for each couple of data record and target value the number of correspondences between and .

In the following we indicate with a black box predictor belonging to the set of uninterpretable data mining and machine learning models. According to Section 3, is a black box because the reasoning behind the function is not understandable by humans and the outcome returned does not provide any clue for its choice. In real-world applications, is an opaque classifier resulting from a learning . Similarly, we indicate with a comprehensible predictor for which is available a global or a local explanation.

The performance of the comprehensible predictor is generally evaluated by two measures. The accuracy is used to evaluate how good are the performance of both the black box predictor and the comprehensible predictor . The fidelity is employed to evaluate how good is the comprehensible predictor in mimicking the black box predictor . Indeed, given a data set we can apply to each record both the predictors: (i) for the black box we get the set of predictions , while (ii) for the comprehensible predictor we get the set of predictions .

Thus, we can evaluate the accuracy of the black box and of the comprehensible predictor by comparing the real target values against the predicted target values , and with and , respectively. Moreover, we can evaluate the behavior of the predictor with respect to evaluating the fidelity of by means of the function . Note that the score can be calculated by applying the same calculus of the function where as target value is used the prediction of the black box instead of the real values .

Black Box Model Explanation

Given a black box model solving a classification problem, the black box explanation problem consists in providing an interpretable and transparent model which is able to mimic the behavior of the black box and which is also understandable by humans (see Figure 4). In other words, the interpretable model approximating the black box must be globally interpretable. As consequence, we define the black box model explanation problem as follows:

Definition 1 (Black Box Model Explanation)

Given a black box predictor and a dataset , the black box model explanation problem consists in finding a function which takes as input a black box and a dataset , and returns a comprehensible global predictor , i.e., , such that is able to mimic the behavior of , and exists a global explanator function that can derive from a set of explanations modeling in a human understandable way the logic behind , i.e., .

A large set of the papers reviewed in this survey describe various designs for the function to solve the black box explanation problem. The set of explanations can be modeled for example by a decision tree or by a set of rules [huysmans2011empirical], while the comprehensible global predictor is the predictor returning as global explanation the decision tree or the set of rules.

Figure 4: Black Box Model Explanation Problem.

Black Box Outcome Explanation

Given a black box model solving a classification problem, the black box outcome explanation problem consists in providing an interpretable outcome, that is a method for providing an explanation for the outcome of the black box. In other words, the interpretable model must return the prediction together with an explanation about the reasons for that prediction, i.e., the prediction is only locally interpretable. It is not required to explain the whole logic behind the black box but only the reasons for the choice of a particular instance. Consequently, we define the black box outcome explanation problem as:

Definition 2 (Black Box Outcome Explanation)

Given a black box predictor and a dataset , the black box outcome explanation problem consists in finding a function which takes as input a black box and a dataset , and returns a comprehensible local predictor , i.e., , such that is able to mimic the behavior of , and exists a local explanator function which takes as input the black box , the comprehensible local predictor , and a data record with features in , and returns a human understandable explanation for the data record , i.e., .

We report in this survey recent works describing very diversified approaches to implement function , overcoming the limitations of explaining the whole model (illustrated in Section 6). As an example, in this view of the problem, we can consider that the explanation may be either a path of a decision tree or an association rule [freitas2014comprehensible].

Figure 5: Black Box Outcome Explanation Problem.

Black Box Inspection Problem

Given a black box model solving a classification problem, the black box inspection problem consists in providing a representation (visual or textual) for understanding either how the black box model works or why the black box returns certain predictions more likely than others.

Definition 3 (Black Box Inspection Problem)

Given a black box predictor and a dataset , the black box inspection problem consists in finding a function which takes as input a black box and a dataset , and returns a visual representation of the behavior of the black box, with being the set of all possible representations.

For example, the function may be a technique based on sensitivity analysis that, by observing the changing occurring in the predictions when varying the input of , returns a set of visualizations (e.g, partial dependence plots [krause2016interacting], or variable effect characteristic curve [cortez2013using]) highlighting the feature importance for the predictions.

Figure 6: Black Box Inspection Problem.

Transparent Box Design Problem

Given a classification problem the transparent box design problem consists in providing a model which is locally or globally interpretable on its own.

Definition 4 (Transparent Box Design Problem)

Given a dataset , the transparent box design problem consists in finding a learning function which takes as input the dataset and returns a (locally or globally) comprehensible predictor , i.e., . This implies that there exists a local explanator function or a global explanator function (defined as before) that takes as input the comprehensible predictor and returns a human understandable explanation or a set of explanations .

For example, the functions and may be the decision tree learner and predictor respectively, while the global explanator may return as explanation a system for following the choices taken along the various branches of the tree, and may return a textual representation of the path followed according to the decision suggested by the predictor.

Figure 7: Transparent Box Design Problem.

Thus, according to our problem definitions, in this survey, when we say that a method is able to open the black box, we are referring to one of the following statements: (i) it explains the model, (ii) it explains the outcome, (iii) it can inspect the black box internally, (iv) it provides a transparent solution.

5 Problem And Explanator Based Classification

In this survey, we propose a classification based on the type of problem faced and on the explanator adopted to open the black box. In particular, in our classification we take into account the following features:

  • the type of problem faced (according to the definitions in Section 4);

  • the type of explanator adopted to open the black box;

  • the type of black box model that the explanator is able to open;

  • the type of data used as input by the black box model.

In each section we group together all the papers that share the same problem definition, while the subsections correspond to the different solutions adopted to develop the explanators. In turn, in each subsection, we group the papers that try to explain the same type of black box. Finally, we keep the type of data used by the black box as a feature which is specified for each work analyzed.

We organize the sections discussing the different problems as follows. In Section 6 we analyze the papers presenting approaches to solve the black box model explanation problem. These approaches provide a globally interpretable predictor which is able to mimic the black box. On the other hand, in Section 7 are reviewed the methods solving the black box outcome explanation problem: the predictor returned is locally interpretable and provides an explanation only for a given record. In Section 8 we discuss the papers proposing methodologies for inspecting black boxes, i.e., not providing a comprehensible predictor but a visualization tool for studying how the black box work internally, and what can happen when a certain input is provided. Finally, in Section 9 we report the papers designing a transparent predictor to overcome the “obscure” limitation of black boxes. These approaches try to provide a global or local interpretable model without sacrificing the accuracy of a black box learned to solve the same task.

For each of the sections above, we propose a further categorization with respect to the type of explanator adopted. This categorization reflects on the papers grouped into the various subsections:

  • Decision Tree (DT) or Single Tree. It is commonly recognized that decision tree is one of the more interpretable and easily understandable models, primarily for global, but also for local, explanations. Indeed, a very widespread technique for opening the black box is the so-called “single tree approximation”.

  • Decion Rules (DR) or Rule Based Explanator. Decision rules are among the more human understandable techniques. There exist various types of rules (illustrated in Section 3.3). They are used to explain the model, the outcome and also for the transparent design. We remark the existence of techniques for transforming a tree into a set of rules.

  • Features Importance (FI). A very simple but effective solution acting as either global or local explanation consists in returning as explanation the set of features used by the black box together with their weight.

  • Salient Mask (SM). An efficient way of pointing out what causes a certain outcome, especially when images or texts are treated, consists in using “masks” visually highlighting the determining aspects of the record analyzed. They are generally used to explain deep neural networks.

  • Sensitivity Analysis (SA). It consists of evaluating the uncertainty in the outcome of a black box with respect to different sources of uncertainty in its inputs. It is generally used to develop visual tools for black box inspection.

  • Partial Dependence Plot (PDP). These plots help in visualizing and understanding the relationship between the outcome of a black box and the input in a reduced feature space.

  • Prototype Selection (PS). This explanator consists in returning, together with the outcome, an example very similar to the classified record, in order to make clear which criteria the prediction was returned. A prototype is an object that is representative of a set of similar instances and is part of the observed points, or it is an artifact summarizing a subset of them with similar characteristics.

  • Neurons Activation (NA)

    . The inspection of neural networks and deep neural network can be carried out also by observing which are the fundamental neurons activated with respect to particular input records.

In the following, we list all the black boxes opened in the reviewed papers. These black boxes are all supervised learning algorithm designed to solve a classification problem [tan2006introduction].

  • Neural Network (NN). Inspired by biological neural networks, artificial neural networks learn to do tasks by considering examples. A NN is formed by a set of connected neurons. Each link between neurons can transmit a signal. The receiving neuron can process the signal and then transmit to downstream neurons connected to it. Typically, neurons are organized in layers. Different layers perform different transformations on their inputs. Signals travel from the input layer, to the output layer, passing through the hidden layer(s) in the middle multiple times. Neurons and connections may also have a weight that varies as learning proceeds, which can increase or decrease the strength of the signal that it sends.

  • Tree Ensemble (TE)

    . Ensemble methods combine more than one learning algorithm to improve the predictive power of any of the single learning algorithms that they combines. Random forests, boosted trees and tree bagging are examples of TEs. They combine the predictions of different decision trees each one trained on an independent subset of the input data.

  • Support Vector Machine (SVM)

    . Support Vector Machines utilize a subset of the training data, called support vectors, to represent the decision boundary. A SVN is a classifier that searches for hyperplanes with the largest margin for the decision boundary.

  • Deep Neural Network (DNN)

    . A DNN is a NN that can model complex non-linear relationship with multiple hidden layers. A DNN architecture is formed by a composition of models expressed as a layered combination of basic units. In DNNs the data typically flows from the input to the output layer without looping back. The most used DNN are Recurrent Neural Networks (RNNs). A peculiar component of RNNs are Long Short-Term Memory (LSTM) nodes which are particularly effective for language modeling. On the other hand, in image processing Convolutional Neural Networks (CNNs) are typically used.

Moreover, recently agnostic approaches for explaining black boxes are being developed. An Agnostic Explanator (AGN) is a comprehensible predictor which is not tied to a particular type of black box, explanation or data type. In other words, in theory, an agnostic predictor can explain indifferently a neural network or a tree ensemble using a single tree or a set of rules. Since only a few approaches in the literature describe themselves to be fully agnostic, and since the principal task is to explain a black box predictor, in this paper, if not differently specified, we term agnostic the approaches defined to explain any type of black box.

The types of data used as input of black boxes analyzed in this survey are the following:

  • Tabular (TAB). With tabular data, we indicate any classical dataset in which every record shares the same set of features and each feature is either numerical, categorical or boolean.

  • Image (IMG). Many black boxes work with labeled images. These images can be treated as they are by the black box or can be preprocessed (e.g, re-sized in order to have all the same dimensions).

  • Text (TXT). As language modeling is one of the tasks most widely assessed nowadays together with image recognition, labeled datasets of text are generally used for tasks like spam detection or topic classification.

In data mining and machine learning many other types of data are also used like sequences, networks, mobility trajectories, etc. However, they are not used as input in the methods of the papers proposing a solution for opening the black box.

Table 1 lists the methods for opening and explaining black boxes and summarizes the various fundamental features and characteristics listed so far, together with additional information that we believe could be useful for the reader. The columns Examples, Code and Dataset indicates if any kind of example of explanation is shown in the paper, and if the source code and the dataset used in the experiments are publicly available, respectively. The columns General and Random are discussed in the following section. We point out that Table 1 reports the main references only, while existing extensions or derived works are discussed in the survey. Table 2 reports the legend of Table 1, i.e., the expanded acronym and the meaning of the features in Table 1. Moreover, in order to provide the reader with a useful tool to find a particular set of papers with determined characteristics, Appendix 0.A provides Tables 3, 4 and 5, in which are reported the list of the papers with respect to each problem, explanator and black box, respectively.

-3.5cm Name Ref. Authors Year Problem Explanator Black Box Data Type General Random Examples Code Dataset Trepan [craven1996extracting] Craven et al. 1996 Model Expl. DT NN TAB - [krishnan1999extracting] Krishnan et al. 1999 Model Expl. DT NN TAB DecText [boz2002extracting] Boz 2002 Model Expl. DT NN TAB GPDT [johansson2009evolving] Johansson et al. 2009 Model Expl. DT NN TAB Tree Metrics [chipman1998making] Chipman et al. 1998 Model Expl. DT TE TAB CCM [domingos1998knowledge] Domingos et al. 1998 Model Expl. DT TE TAB - [gibbons2013cad] Gibbons et al. 2013 Model Expl. DT TE TAB STA [zhou2016interpreting] Zhou et al. 2016 Model Expl. DT TE TAB CDT [schetinin2007confident] Schetinin et al. 2007 Model Expl. DT TE TAB - [hara2016making] Hara et al. 2016 Model Expl. DT TE TAB TSP [tan2016tree] Tan et al. 2016 Model Expl. DT TE TAB Conj Rules [craven1994using] Craven et al. 1994 Model Expl. DR NN TAB G-REX [johansson2003rule] Johansson et al. 2003 Model Expl. DR NN TAB REFNE [zhou2003extracting] Zhou et al. 2003 Model Expl. DR NN TAB RxREN [augasta2012reverse] Augasta et al. 2012 Model Expl. DR NN TAB SVM+P [nunez2002rule] Nunez et al. 2002 Model Expl. DR SVM TAB - [fung2005rule] Fung et al. 2005 Model Expl. DR SVM TAB inTrees [deng2014interpreting] Deng 2014 Model Expl. DR TE TAB - [lou2013accurate] Lou et al. 2013 Model Expl. FI AGN TAB GoldenEye [henelius2014peek] Henelius et al. 2014 Model Expl. FI AGN TAB PALM [krishnan2017palm] Krishnan et al. 2017 Model Expl. DT AGN ANY - [tolomei2017interpretable] Tolomei et al. 2017 Model Expl. FI TE TAB - [xu2015show] Xu et al. 2015 Outcome Expl. SM DNN IMG - [fong2017interpretable] Fong et al. 2017 Outcome Expl. SM DNN IMG CAM [zhou2016learning] Zhou et al. 2016 Outcome Expl. SM DNN IMG Grad-CAM [selvaraju2016grad] Selvaraju et al. 2016 Outcome Expl. SM DNN IMG - [lei2016rationalizing] Lei et al. 2016 Outcome Expl. SM DNN TXT LIME [ribeiro2016nothing] Ribeiro et al. 2016 Outcome Expl. FI AGN ANY MES [turner2016model] Turner et al. 2016 Outcome Expl. DR AGN ANY NID [olden2002illuminating] Olden et al. 2002 Inspection SA NN TAB GDP [baehrens2010explain] Baehrens 2010 Inspection SA AGN TAB IG [sundararajan2017axiomatic] Sundararajan 2017 Inspection SA DNN ANY VEC [cortez2011opening] Cortez et al. 2011 Inspection SA AGN TAB VIN [hooker2004discovering] Hooker 2004 Inspection PDP AGN TAB ICE [goldstein2015peeking] Goldstein et al. 2015 Inspection PDP AGN TAB Prospector [krause2016interacting] Krause et al. 2016 Inspection PDP AGN TAB Auditing [adler2016auditing] Adler et al. 2016 Inspection PDP AGN TAB OPIA [adebayo2016iterative] Adebayo et al. 2016 Inspection PDP AGN TAB - [yosinski2015understanding] Yosinski et al. 2015 Inspection NA DNN IMG TreeView [thiagarajan2016treeview] Thiagarajan et al. 2016 Inspection DT DNN TAB IP [shwartz2017opening] Shwartz et al. 2017 Inspection NA DNN TAB - [radford2017learning] Radford 2017 Inspection NA DNN TXT CPAR [yin2003cpar] Yin et al. 2003 Transp. Design DR - TAB FRL [wang2015falling] Wang et al. 2015 Transp. Design DR - TAB BRL [letham2015interpretable] Letham et al. 2015 Transp. Design DR - TAB TLBR [su2015interpretable] Su et al. 2015 Transp. Design DR - TAB IDS [lakkaraju2016interpretable] Lakkaraju et al. 2016 Transp. Design DR - TAB Rule Set [wang2016bayesian] Wang et al. 2016 Transp. Design DR - TAB 1Rule [malioutov2017learning] Malioutov et al. 2017 Transp. Design DR - TAB PS [bien2011prototype] Bien et al. 2011 Transp. Design PS - ANY BCM [kim2014bayesian] Kim et al. 2014 Transp. Design PS - ANY - [mahendran2015understanding] Mahendran et al. 2015 Transp. Design PS - IMG - [kononenko2010efficient] Kononenko et al. 2010 Transp. Design FI - TAB OT-SpAMs [wang2015trading] Wang et al. 2015 Transp. Design DT - TAB

Table 1: Summary of methods for opening and explaining black boxes.
Feature Description
Problem Model Explanation, Outcome Explanation, Black Box Inspection, Transparent Design
Explanator DT - Decision Tree, DR - Decision Rules, FI - Features Importance, SM - Saliency Masks, SA - Sensitivity Analysis, PDP - Partial Dependence Plot, NA - Neurons Activation, PS - Prototype Selection
Black Box NN - Neural Network, TE - Tree Ensemble, SVM - Support Vector Machines, DNN - Deep Neural Network, AGN - AGNostic black box
Data Type TAB - TABular, IMG - IMaGe, TXT - TeXT, ANY - ANY type of data
General Indicates if an explanatory approach can be generalized for every black box, i.e., it does not consider peculiarities of the black box to produce the explanation
Random Indicates if any kind of random perturbation or permutation of the original dataset is required for the explanation
Examples Indicates if example of explanations are shown in the paper
Code Indicates if the source code is available
Dataset Indicates if the datasets used in the experiments are available
Table 2: Legend of Table  1. In the following are described the features reported and the abbreviations adopted.

Reverse Engineering: A Common Approach For Understanding The Black Box

Before proceeding in the detailed analysis and classification of papers proposing method for understanding black boxes , we present in this section the most largely used approach to solve the black box model and outcome explanation problems and the black box inspection problem. We refer to this approach as reverse engineering because the black box predictor is queried with a certain test dataset in order to create an oracle dataset that in turn will be used to train the comprehensible predictor (see Figure 8). The name reverse engineering comes from the fact that we can only observe the input and output of the black box.

Figure 8: Reverse Engineering approach: the learned black box predictor is queried with a test dataset to produce an oracle which associate to each record a label which is not real but assigned by the black box.

With respect to the black box model and outcome explanation problems, the possibility of action tied with this approach relies on the choice of adopting a particular type of comprehensible predictor, and in the possibility of querying the black box with input records created in a controlled way and/or by using random perturbations of the initial train or test dataset. Regarding the random perturbations of the input used to feed the black box, it is important to recall that recent studies discovered that DNN built for classification problems on texts and images can be easily fooled (see Section 2). Not human perceptible changes in an image can lead a DNN to label the record as something else. Thus, according to these discoveries, the methods treating images or text, in theory, should not be enabled to use completely random perturbations of their input. However, this is not always the case in practice [ribeiro2016should].

Such reverse engineering approach can be classified as generalizable or not (or pedagocial vs. decompsitional as described in [martens2007comprehensible]). We say that an approach is generalizable when a purely reverse engineering procedure is followed, i.e., the black box is only queried with different input records to obtain an oracle used for learning the comprehensible predictor (see Figure 9-(left)). In other words, internal peculiarities of the black box are not exploited to build the comprehensible predictor. Thus, if an approach is generalizable, even though it is presented to explain a particular type of black box, in reality, it can be used to interpret any kind of black box predictor. That is, it is an agnostic approach for interpreting black boxes. On the other hand, we say that an approach is not generalizable if it can be used to open only that particular type of black box for which it was designed for (see Figure 9-(right)). For example, if an approach is designed to interpret random forest and internally use a concept of distance between trees, then such approach can not be utilized to explain predictions of a NN. A not generalizable approach can not be black box agnostic.

Figure 9: (Left) Generalizable reverse engineering approach: internal peculiarities of the black box are not exploited to build the comprehensible predictor. (Right) Not Generalizable reverse engineering approach: the comprehensible predictor is the result of a procedure involving internal characteristics of the black box.

In Table 1 we keep track of these aspects with the two features General and Random. With General we indicate if an explanatory approach can be generalized for every black box, while with Random we indicate if any kind of random perturbation or permutation of the original dataset is used by the explanatory approach.

In light of these concepts, as the reader will discover below, a further classification not explicitly indicated emerges from the analysis of these papers. This fact can be at the same time a strong point or a weakness of the current state of the art. Indeed, we highlight that the works for opening the black box are realized for two cases. The first (larger) group contains approaches proposed to tackle a particular problem (e.g., medical cases) or to explain a particular type of black box, that is, the solutions are specific for the problem instance. The second group contains general purpose solutions that try to be general as much as possible and propose agnostic and generalizable solutions.

6 Solving the Black Box Model Explanation Problem

In this section we review the methods for opening the black box facing the black box model explanation problem (see Section 4). That is, the proposed methods provide globally interpretable models which are able to mimic the behavior of black boxes and which are also understandable by humans. We recognized different groups of approaches. In Section 6.1 we analyze the proposals using a decision tree as explanator, while in Section 6.2 they use rules. Section 6.3 describes the methods which are designed to work with any type of black box. Finally, Section 6.4 contains the remaining ones.

6.1 Explanation via Single Tree Approximation

In this section we present a set of works addressing the black box model explanation problem by implementing in different ways the function . All the following works adopt a decision tree as comprehensible global predictor , and consequently represent the explanation with the decision tree itself. Moreover, we point out that all the methods presented in this section work on tabular data.

6.1.1 Explanation of Neural Networks.

The following papers describe the implementation of functions which are able to interpret a black box consisting in a Neural Network (NN) [tan2006introduction] with a comprehensible global predictor consisting in a decision tree. In these works, the NNs are considered black-boxes, i.e., the only interface permitted is presenting an input to the neural network and obtaining the outcome . The final goal is to comprehend how the neural networks behave by submitting to it a large set of instances and analyzing their different predictions.

Single tree approximations for NNs were first presented in 1996 by Craven et al. [craven1996extracting]. The comprehensible representations of the neural network is returned by Trepan which is the implementation of function . Trepan queries the neural network to induce a decision tree approximating the concepts represented by the networks by maximizing the gain ratio [tan2006introduction] together with an estimation of the current model fidelity. Another advantage of Trepan with respect to common tree classifiers like ID3 or C4.5 [tan2006introduction] is that, thanks to the black box , it can use as many instances as desired for each split, so that also the node splits near to the bottom of the tree are realized using a considerable amount of training data.

In [krishnan1999extracting], Krishnan et al. present a three step method . The first step generates a sort of “prototype” for each target class in

by using genetic programming to query the trained neural network

. The input features dataset is exploited for constraining the prototypes. The second step selects the best prototypes for inducing the learning of the decision tree in the third step. This approach leads to get more understandable and smaller decision trees starting from smaller data sets.

In [boz2002extracting], Boz describes DecText, another procedure that uses a decision tree to explain neural network black boxes . The overall procedure recalls Trepan [craven1996extracting] with the innovation of four splitting methods aimed at finding the most relevant features during the tree construction. Moreover, since one of the main purposes of the tree is to maximize the fidelity while keeping the model simple, a fidelity pruning strategy to reduce the tree size is defined. A set of random instances are generated. Then, starting from the bottom of the tree, for each internal node a leaf is created with the majority label using the labeling of the random instances. If the fidelity of the new tree overtakes the old one, than the maximum fidelity and the tree are updated.

In [johansson2009evolving] Johansson et al. use Genetic Programming to evolve Decision Trees (the comprehensible global predictor ), in order to mimic the behavior of a neural network ensemble (the black box ). The dataset used by genetic programming (implementing function ) consists of a lot of different combinations of the original data and oracle data labeled by . The paper shows that trees based only on original training data have the worst performance in terms of accuracy in the test data, while the trees evolved using both the oracle guide and the original data produce significantly more accurate trees .

We underline that, even though these approaches are developed to explain neural networks, since peculiarities of the neural networks are not used by , which uses only as an oracle, these approaches can be potentially adopted as agnostic explanators, i.e., they can be used to open any kind of black box and represent it with a single tree.

6.1.2 Explanation of Tree Ensembles.

Richer collections of trees provide higher performance and less uncertainty in the prediction. On the other hand, it is generally difficult to make sense of the resultant forests. The papers presented in this section describe functions for approximating a black box model consisting in Tree Ensembles (TE) [tan2006introduction] (e.g. random forests) with a global comprehensible predictor in the form of a decision tree, and explanation as a the decision tree as before.

Unlike previous works, the tree ensembles are not only viewed as black boxes, but also some of their internal features are used to derive the global comprehensible model . For example, Chipman et al., in [chipman1998making] observe that although hundreds of distinct trees are identified by random forests, in practice, many of them generally differ only by few nodes. In addition, some trees may differ only in the topology, but use the same partitioning of the feature space . The paper proposes several measures of dissimilarity for trees. Such measures are used to summarize forest of trees through clustering, and finally use archetypes of the associated clusters as model explanation. Here, corresponds to the clustering procedure, and the global comprehensible predictor is the set of tree archetypes minimizing the distance among all the trees in each cluster. In this approach, does not extend the input dataset with random data.

On the other hand, random data enrichment and model combination are the basis for the Combined Multiple Model (CCM) procedure presented in [domingos1998knowledge]. Given the tree ensemble black box , it first modifies times the input dataset and learns a set of black boxes , and then it randomly generates data record which are labeled using a combination (e.g. bagging) of the black boxes , i.e., . In this way, the training dataset is increased. Finally, it builds the global comprehensible model as a decision tree (C4.5 [quinlan1993c4]) on the enriched dataset . Since it is not exploiting particular features of the tree ensemble , also this approach can be generalized with respect to the black box . In line with [domingos1998knowledge], the authors of [gibbons2013cad] generate a very large artificial dataset using the prediction of the random forest , then explain by training a decision tree on this artificial dataset in order to mime the behavior of the random forest. Finally, they improve the comprehensibility of by cutting the decision tree with respect to a human understandable depth (i.e., from 6 to 11 nodes of depth). [zhou2016interpreting] proposes Single Tree Approximation (STA), an extension of [gibbons2013cad] which empowers the construction of the final decision tree by using test hypothesis to understand which are the best splits observing the Gini indexes on the trees of the random forest .

Schetinin et al. in [schetinin2007confident] present an approach for the probabilistic interpretation of the black box Bayesian decision trees ensembles [breiman1984classification] through a quantitative evaluation of uncertainty of a Confident Decision Tree (CDT) . The methodology for interpreting is summarized as follows: (i) the classification confidence for each tree in the ensemble is calculated using the training data , (ii) the decision tree that covers the maximal number of correct training examples is selected, keeping minimal the amount of misclassifications on the remaining examples by sub-sequentially refining the training dataset . Similarly to [chipman1998making], also this explanation method does not extend the input dataset with random data and cannot be generalized to other black boxes but can be used only with Bayesian decision tree ensembles.

In [hara2016making], Hara et al. reinterpret Additive Tree Models (ATM) (the black box ) using a probabilistic generative model interpretable by humans. An interpretable ATM has a sufficiently small number of regions. Therefore, their aim is to reduce the number of regions in an ATM while minimizing the model error. To satisfy these requirements, they propose a post processing method that works as follows. First, it learns an ATM generating a number of regions. Then, it mimics using a simpler model (the comprehensible global predictor

) where the number of regions is fixed as small, e.g., ten. In particular, to obtain the simpler model an Expectation Maximization algorithm is adopted


minimizing the Kullback-Leibler divergence from the ensemble


The authors of [tan2016tree] propose Tree Space Prototype (TSP), an approach for interpretating tree ensembles (the black box ) by finding tree prototypes (the comprehensible global predictor ) in the tree space. The main contributions for are: (i) the definition of the random forest proximity between trees, and (ii) the design of the procedure to extract the tree prototypes used for classification.

6.2 Explanation via Rule Extraction

Another commonly used state of the art interpretable and easily understandable model is the set of rules. When a set of rules describing the logic behind the black box model is returned the interpretability is provided at a global level. In the following, we present a set of reference works solving the black box model explanation problem by implementing in different ways function , and by adopting any kind of decision rules as comprehensible global predictor . Hence, the global explanation change accordingly to the type of rules extracted by . Similarly to the previous section, also all the methods presented in this section work on tabular data.

6.2.1 Explanation of Neural Networks.

The following papers describe the implementation of functions which are able to interpret a black box consisting in a Neural Network (NN) [tan2006introduction]. In the literature already exists a survey specialized on techniques extracting rules from neural networks [andrews1995survey]. It provides an overview of mechanisms designed to (i) insert knowledge into neural networks (knowledge initialization), (ii) extract rules from trained NNs (rule extraction), and (iii) use NNs to refine existing rules (rule refinement). The approaches presented in [andrews1995survey] are strongly dependent on the black box and on the specific type of decision rules . Thus, they are not generalizable and can not be employed to solve other instances of the black box model explanation problem. The survey [andrews1995survey] classifies the methods according to the following criteria:

  • Expressive power of the extracted rules.

  • Translucency: that is decompositional, pedagogical and eclectic properties.

  • Portability of the rule extraction technique.

  • Quality of the rules extracted. Quality includes accuracy, fidelity, consistency, i.e., different training of the NN extract the rules that lead to the same classification of unseen examples.

  • Algorithmic complexity.

A typical paper analyzed in [andrews1995survey] is [craven1994using] where Craven et al. present a method to explain the behavior of a neural network by transforming rule extraction (which is a search problem) into a learning problem. The original training data and a randomized extension of it are provided as input to the black box . If the input with outcome is not covered by the set of rules, then a conjunctive (or m-of-n) rule is formed from considering all the possible antecedents. The procedure ends when all the target classes have been processed.

In [johansson2004accuracy] Johansson et al. exploit G-REX [johansson2003rule], an algorithm for rule extraction, as function to explain a neural network . They use the classical reverse engineering schema where random permutations of the original dataset are annotated by , and such dataset is used as input by G-REX, which corresponds with in this case. In particular, G-REX extracts rules by exploiting genetic programming as a key concept. In subsequent works, the authors show that the proposed methodology can be also employed to interpret trees ensembles. [johansson2004truth] extends G-REX for handling regression problems by generating regression trees, and classification problems by generating fuzzy rules.

In [zhou2003extracting] the authors present REFNE, an approach to explain neural network ensembles . REFNE uses ensembles for generating instances and then, extracts symbolic rules from those instances. REFNE avoids useless discretizations of continuous attributes, by applying a particular discretization leading to discretize different continuous attributes using different intervals. Moreover, REFNE can also be used as a rule learning approach, i.e., it solves the transparent box design problem (see Section 4). Also in [augasta2012reverse] Augasta et al. propose RxREN a rule extraction algorithm which returns the explanation of a trained NN . The method works as follows. First, it prunes the insignificant input neurons from trained NNs and identifies the data range necessary to classify the given test instance with a specific class. Second, using a reverse engineering technique, through RxREN generates the classification rules for each class label exploiting the data ranges previously identified, and improve the initial set of rules by a process that prunes and updates the rules.

6.2.2 Explanation of Support Vector Machines.

The following papers show implementations of functions for explaining a black box consisting in a Support Vector Machine (SVM) [tan2006introduction] still returning a comprehensible global predictor consisting in a set of decision rules.

The authors of [nunez2002rule] propose the SVM+Prototypes (SVM+P) procedure for rule extraction from support vector machines . It works as follows: it first determines the decision function by means of a SVM, then a clustering algorithm is used to find out a prototype vector for each class. By using geometric methods, these points are joined with the support vectors for defining ellipsoids in the input space that can be transformed into if-then rules.

Fung et al., in [fung2005rule], describe as function an algorithm based on constraint programming for converting linear SVM (and other hyperplane-based linear classifiers) into a set of non overlapping and interpretable rules . These rules are asymptotically equivalent to the original linear SVM. Each iteration of the algorithm for extracting the rules is designed to solve a constrained optimization problem having a low computational cost. We underline that this black box explanation solution is not generalizable and can be employed only for Linear SVM-like black boxes.

In [martens2007comprehensible] the authors propose a qualitative comparison of the explanations returned by techniques for extraction of rules from SVM black boxes (e.g. SVM+P [nunez2002rule], Fung method [fung2005rule]) against the redefining of methods designed for explaining neural networks, i.e., C4.5 [tan2006introduction], Trepan [craven1996extracting] and G-REX [johansson2003rule]. How we anticipated in the previous section, the authors delineate the existence of two type of approaches to extract rules: pedagogical and decompositional. Pedagogical techniques directly extract rules which relate the inputs and outputs of the predictor (e.g. [craven1996extracting, johansson2003rule]), while decompositional approaches are closely intertwined with the internal structure of the SVM (e.g. [nunez2002rule, fung2005rule]). We recall that, in Table 1 we identify with the term generalizable the pedagogical approaches.

Figure 10: From [martens2007comprehensible]: pedagogical (a) and decompositional (b) rule extraction techniques.

6.2.3 Explanation of Tree Ensembles.

Finally, in [deng2014interpreting], Deng proposes the inTrees framework to explain black boxes defined as Tree Ensembles (TE) by returning a set of decision rules . In particular, InTrees extracts, measures, prunes and selects rules from tree ensembles, and calculates frequent variable interactions. The set of black boxes that inTrees can explain is represented by any kind of tree ensemble like random forests, regularized random forests and boosted trees. InTrees framework can be used for both classification and regression problems. The technique described by InTrees is also known as Simplified Tree Ensamble Learner (STEL): it extracts the most supported and simplest rules form the trees ensemble.

6.3 Agnostic Explanator

Recent approaches for interpretation are agnostic (AGN) with respect to the black box to be explained. In this section, we present a set of works solving the black box model explanation problem by implementing function such that any type of black box can be explained. These approaches do not return a specific comprehensible global predictor , thus the type of explanation change with respect to and . By definition all these approaches are generalizable.

Probably the first attempt of an agnostic solution was proposed in [lou2012intelligible]. Lou et al. propose a method which exploits Generalized Additive Models (GAMs) and it is able to interpret regression splines (linear and logistics), single trees and tree ensembles (bagged trees, boosted trees, boosted bagged trees and random forests). GAMs are presented as the gold standard for intelligibility when only univariate terms are considered. Indeed, the explanation is returned as the importance of the contribution of the individual features in together with their shape function, such that the impact of each predictor can be quantified. A shape function is the plot of a function capturing the linearities and nonlinearities together with its shape. It works on tabular data. A refinement of the GAM approach is proposed by the same authors in [lou2013accurate]. A case study on health care showing the application of the GAM the refinement is presented in [caruana2015intelligible]. In particular, this approach is used for the prediction of the pneumonia risk and hospital 30-day readmission.

In [henelius2014peek] the authors present an iterative algorithm that allows finding features and dependencies exploited by a classifier when producing a prediction. The attributes and the dependencies among the grouped attributes depict the global explanation . The proposed approach named GoldenEye is based on tabular data randomization (within class permutation, dataset permutation, etc.) and on grouping attributes with interactions have an impact on the predictive power.

In [krishnan2017palm], PALM is presented (Partition Aware Local Model) to implement . In particular, PALM is a method that is able to learn and summarize the structure of the training dataset to help the machine learning debugging. PALM mimes a black box using a meta-model for partitioning the training dataset, and a set of sub-models for approximating and miming the patterns within each partition. As meta-model it uses a decision tree () so that the user can examine its structure and determine if the rules detected follow the intuition or not, and link efficiently problematic test records to the responsible train data. The sub-models linked to the leaves of the tree can be a arbitrarily complex model able to catch elaborate local patterns, but yet interpretable by humans. Thus, with respect to the final sub-models PALM is not only black box agnostic but also explanator agnostic. Moreover, PALM is also data agnostic; i.e., it can work on any kind of data.

6.4 Explanation via Other Approaches

In [tolomei2017interpretable] a solution for the black box model explanation problem is presented. It adopts an approach that can not be classified as one of the previous. The proposed approach uses the internals of a random forest model to produce recommendations on the transformation of true negative examples into positively predicted examples. These recommendations, which are strictly related to the feature importance, corresponds to the comprehensible global predictor . In particular, the function aims at transforming a negative instance into a positive instance by analyzing the path on the trees in the forest predicting such instance as positive or negative. The explanation of is provided by means of the helpfulness of the features in the paths adopted for changing the instance outcome from negative to positive.

7 Solving the Black Box Outcome Explanation Problem

In this section we review the methods solving the black box outcome explanation problem (see Section 4). These methods provide a locally interpretable model which is able to explain the prediction of the black box in understandable terms for humans. This category of approaches using a local point of view with respect to the prediction is becoming the most studied in the last years. Section 7.1 describes the methods providing the salient parts of the record for which a prediction is required using Deep Neural Networks (DNNs), while Section 7.2 analyzes the methods which are able to provide a local explanation for any type of black box.

7.1 Explanation of Deep Neural Network via Saliency Masks

In the following works the opened black box is a DNN and the explanation is provided by using a Saliency Mask (SM), i.e. a subset of the original record which is mainly responsible for the prediction. For example, as salient mask we can consider the part of an image or a sentence in a text. A saliency image summarizes where a DNN looks into an image for recognizing their predictions. The function to extract the local explanation is always not generalizable and often strictly tied with the particular type of network, i.e., convolutional, recursive, etc.

The work [xu2015show] introduces an attention based model which automatically identifies the contents of an image. The black box is a neural network which consists of a combination of a Convolutional NN (CNN)

for the features extraction and a

Recursive NN (RNN) containing Long Short Term Memory (LSTM), nodes producing the image caption by generating a single word for each iteration. The explanation of the prediction is provided through a visualization of the attention (area of an image, see Figure 11-left) for each word in the caption. A similar result is obtained by Fong et al. in [fong2017interpretable]. In this work the authors propose a framework of explanations as meta-predictors. In their view, an explanation , and thus a meta-predictor, is a rule that predicts the response of a black box to certain inputs. Moreover, they propose to use saliency maps as explanations for black boxes to highlight the salient part of the images (see Figure 11-right).

Figure 11: Saliency Masks for explanation of deep neural network. (Left) From [xu2015show] the elements of the image highlighted. (Right) From [fong2017interpretable] the mask and the level of accuracy on the image considering and not considering the learned mask.

Similarly, another set of works produce saliency masks incorporating network activations into their visualizations. This kind of approaches are named Class Activation Mapping (CAM). In [zhou2016learning], global average pooling in CNN (the black box ) is used for generating the CAM. A CAM (the local explanation ) for a particular outcome label indicates the discriminative active region that identifies that label. [selvaraju2016grad] defines its relaxed generalization Grad-CAM which visualizes the linear combination of a late layer’s activations and label-specific weights (or gradients for [zhou2016learning]). All these approaches arbitrarily invoke different back propagation and/or activation, which results in aesthetically pleasing, heuristic explanations of image saliency. Their solution is not black box agnostic limited to NN, but it requires specific architectural modifications [zhou2016learning] or access to intermediate layers [selvaraju2016grad].

With respect to texts, in [lei2016rationalizing] the authors develop an approach which incorporates rationales as part of the learning process of . A rationale is a simple subset of words representing a short and coherent piece of text (e.g., phrases), and alone must be sufficient for the prediction of the original text. A rational is the local explanator and provides the saliency of the text analyzed, i.e., indicates the reason for a certain outcome.

7.2 Agnostic Explanator

In this section we present the agnostic solutions proposed for the black box outcome explanation problem implementing function such that any type of black box can be explained. All these approaches are generalizable by definition and return a comprehensible local predictor . Thus, they can be employed for diversified data types.

In [ribeiro2016should], Ribeiro et al. present the Local Interpretable Model-agnostic Explanations (LIME) approach which does not depend on the type of data, nor on the type of black box to be opened, nor on a particular type of comprehensible local predictor or explanation . In other words, LIME can return an understandable explanation for the prediction obtained by any black box. The main intuition of LIME is that the explanation may be derived locally from the records generated randomly in the neighborhood of the record to be explained, and weighted according to their proximity to it. In their experiments, the authors adopt linear models as comprehensible local predictor returning the importance of the features as explanation . As black box

the following classifiers are tested: decision trees, logistic regression, nearest neighbors, SVM and random forest. A weak point of this approach is the required transformation of any type of data in a binary format which is claimed to be human interpretable.

[ribeiro2016nothing] and [ribeiro2016model] propose extensions of LIME with an analysis of particular aspects and cases.

A similar approach is presented in [turner2016model], where Turner et al. design the Model Explanation System (MES) that augments black box predictions with explanations by using a Monte Carlo algorithm. In practice, they derive a scoring system for finding the best explanation based on formal requirements and consider that the explanations are simple logical statements, i.e., decision rules. The authors test logistic regression and SVMs as black box .

8 Solving the Black Box Inspection Problem

In this section we review the methods for opening the black box facing the black box inspection problem (see Section 4). Given a black box solving a classification problem, the inspection problem consists in providing a representation for understanding either how the black box model works or why the black box returns certain predictions more likely than others. In [seifert2017visualizations], Seifet et al. provide a survey of visualizations of DNNs by defining a classification scheme describing visualization goals and methods. They found that most papers use pixel displays to show neuron activations. As in the previous sections, in the following we propose a classification based on the type of technique used to provide the visual explanation of how the black box works. Most papers in this section try to inspect NNs and DNNs.

8.1 Inspection via Sensitivity Analysis

In this section we review the works solving the black box inspection problem by implementing function using Sensitivity Analysis (SA). Sensitivity analysis studies the correlation between the uncertainty in the output of a predictor and that one in its inputs [saltelli2002sensitivity]. All the following methods work on tabular datasets.

Sensitivity analysis for “illuminating” the black box was first proposed by Olden in [olden2002illuminating] where a visual method for understanding the mechanism of NN is described. In particular, they propose to assess the importance of axon connections and the contribution of input variables by means of sensitivity analysis and Neural Interpretation Diagram (NID) to remove not significant connections and improve the network interpretability.

In [baehrens2010explain] the authors propose a procedure based on Gaussian Process Classification (GDP) which allows explaining the decisions of any classification method through an explanation vector. That is, the procedure is black box agnostic. The explanation vectors are visualized to highlight the features that were most influential for the decision of a particular instance. Thus, we are dealing with an inspection for outcome explanation .

In [datta2016algorithmic], Datta et al. introduce a set of Quantitative Input Influence (QII) measures capturing how much inputs influence the outputs of black box predictors. These measures provide a foundation for transparency reports of black box predictors. In practice, the output consists in the feature importance for outcome predictions.

[sundararajan2017axiomatic] studies the problem of attributing the prediction of a DNN (the black box ) to its input features. Two fundamental axioms are identified: sensitivity and implementation invariance. These axioms guide the design of an attribution method , called Integrated Gradients (IG), that requires no modification to the original network. Differently from the previous work, this approach is tested on different types of data.

Finally, Cortez in [cortez2011opening, cortez2013using] uses sensitivity analysis based and visualization techniques to explain black boxes

. The sensitivity measures are variables calculated as the range, gradient, variance of the prediction. Then, the visualizations realized are barplots for the features importance, and

Variable Effect Characteristic curve (VEC) [cortez2009using] plotting the input values (x-axis) versus the (average) outcome responses (see Figure 12 - (left)).

Figure 12: (Left). From [cortez2011opening] VEC curve and histogram for the pH input feature (x-axis) and the respective high quality wine probability outcome (left of y-axis) and frequency (right of y-axis). (Right). From [krause2016interacting] Age at enrollment shown as line plot (top) and partial dependence bar (middle). Color denotes the predicted risk of the outcome.

8.2 Inspection via Partial Dependence

In this section we report a set of approaches solving the black box inspection problem by implementing a function which returns a Partial Dependence Plot (PDP)

. Partial dependence plot is a tool for visualizing the relationship between the response variable and predictor variables in a reduced feature space. All the approaches presented in this section are black box agnostic and are tested on tabular datasets.

In [hooker2004discovering], the authors present an approach aimed at evaluating the importance of non-additive interactions between any set of features. The implementation uses the Variable Interaction Network (VIN) visualization generated from the use of ANOVA statistical methodology (a technique to calculate partial dependence plots). VIN allows to visualize the importance of the features together with their interdependences. Goldstein et al. provide in [goldstein2015peeking] a technique which extends classical PDP named Individual Conditional Expectation (ICE) to visualize the model approximated by a black box that help in visualizing the average partial relationship between the outcome and some features. ICE plots improves PDP by highlighting the variation in the fitted values across the range of covariates. In [krause2016interacting], Krause et al. introduce random perturbations on the black box input values to understand to which extent every feature impact the prediction through a visual inspection using the PDPs . The main idea of Prospector is to observe how the output varies by varying the input changing one variable at a time. It provides an effective way to understand which are the most important features for a certain prediction so that it can help in providing a valuable interpretation (see Figure 12 - (right)). In [adler2016auditing] the authors propose a method for auditing (i.e., inspecting) black box predictors , studying to which extent existing models benefit of specific features in the data. This method does not assume any knowledge on the models behavior. In particular, the method focuses on indirect influence and visualizes the global inspection through an obscurity vs. accuracy plot (the features are obscured one after the other). Yet, the dependence of a black box on its input features is relatively quantified by the procedure proposed in [adebayo2016iterative], where the authors present an iterative procedure based on Orthogonal Projection of Input Attributes (OPIA), for enabling the interpretability of black box predictors.

8.3 Inspection via Other Approaches

In the following, we present solutions for the black box inspection problem that adopt an approach which can be categorized as none of the previous ones. They all refer to DNNs as black box and are not generalizable.

[yosinski2015understanding] proposes two tools for visualizing and interpreting DNNs and for understanding what computations DNNs perform at intermediate layers and which neurons activate. These tools visualize the activations of each layer of a trained CNN during the process of images or videos. Moreover, they visualize the features of the different layers by regularized optimization in image space. Yosinski et al. found that by analyzing the live activations, changing in correspondence of different inputs, helps to generate an explanation on the DNNs behave. [thiagarajan2016treeview] shows the extraction of a visual interpretation of a DNN using a decision tree. The method TreeView works as follows. Given the black box as a DNN, it first decomposes the feature space into K (user defined) potentially overlapping factors. Then, it builds a meta feature for each of the K clusters and a random forest that predicts the cluster labels. Finally, it generates and shows a surrogate decision tree from the forest as an approximation of the black box.

Shwartz-Ziv et al. in [shwartz2017opening] showed the effectiveness of the Information Plane

visualization of DNNs highlighting that the empirical error minimization of each stochastic gradient descent phase epoch is always followed by a slow representation compression.

Finally, it is worth mentioning that [radford2017learning]

presents the discovery that a single neuron unit of a DNN can perform alone a sentiment analysis task after the training of the network reaching the same level of performance of strong baselines. Also in

[zeiler2014visualizing], Zeiler et al. backtrack the network computations to identify which image patches are responsible for certain neural activations.

9 Solving the Transparent Box Design Problem

In this section we review the approaches designed to solve the classification problem using a transparent method which is locally or globally interpretable on its own, i.e., solving the transparent box design problem (see Section 4).

9.1 Explanation via Rule Extraction

In this section we present the most relevant state of the art works solving the transparent box design problem by means of comprehensible predictors based on rules. In these cases, is a comprehensible global predictor providing the whole set of rules leading to any possible decision: a global explanator is made available by . All the methods presented in this section work on tabular data.

In [yin2003cpar] the authors propose the approach named CPAR (Classification based on Predictive Association Rules) combining the positive aspects of both associative classification and traditional rule-based classification. Indeed, following the basic idea of FOIL [quinlan1993foil], CPAR does not generate a large set of candidates as in associative classification, and applies a greedy approach for generating rules directly from training data.

Wang and Rudin, in [wang2015falling] propose a method to extract falling rule lists (see Section 3.3) instead of classical rules. The falling rule lists extraction method relies on a Bayesian framework.

In [letham2015interpretable], the authors tackle the problem to build a system for medical scoring which is interpretable and characterized by high accuracy. To this end, they propose Bayesian Rule Lists (BRL) to extract the comprehensible global predictor as a decision list. A decision list consists of a series of if-then statements discretizing the whole feature space into a set of simple and directly interpretable decision statements.

A Bayesian approach is followed also in [su2015interpretable]. The authors propose algorithms for learning Two-Level Boolean Rules (TLBR) in Conjunctive Normal Form or Disjunctive Normal Form . Two formulations are proposed. The first one is an integer program whose objective function combines the total number of errors and the total number of features used in the rule. The second formulation replaces the 0-1 classification error with the Hamming distance from the current two-level rule to the closest rule that correctly classifies a sample. In [lakkaraju2017interpretable] the authors propose a method exploiting a two-level boolean rule predictor to solve the black box model explanation, i.e., the transparent approach is used in the reverse engineering approach to explain the black box.

Yet another type of rule is exploited in [lakkaraju2016interpretable]. Here, Lakkaraju et al. propose a framework for generating prediction models, which are both interpretable and accuratem, by extracting Interpretable Decision Sets (IDS) , i.e., independent if-then rules. Since each rule is independently applicable, decision sets are simple, succinct, and easily to be interpreted. In particular, this approach can learn accurate, short, and non-overlapping rules covering the whole feature space.

Rule Sets are adopted in [wang2016bayesian] as comprehensible global predictor . The authors present a Bayesian framework for learning Rule Sets. A set of parameters is provided to the user to encourage the model to have a desired size and shape in order to conform with a domain-specific definition of interpretability. A Rule Set consists of a small number of short rules where an instance is classified as positive if it satisfies at least one of the rules. The rule set provides reasons for predictions, and also descriptions of a particular class.

Finally, in [malioutov2017learning] an approach is designed to learn both sparse conjunctive and disjunctive

clause rules from training data through a linear programming solution. The optimization formulation leads the resulting rule-based global predictor

(1Rule) to automatically balance accuracy and interpretability.

9.2 Explanation via Prototype Selection

In this section we present the design of a set of approaches for solving the transparent box design problem returning a comprehensible predictor equipped with a human understandable global explanator function . A prototype, also referred to with the name artifact or archetype, is an object that is representative of a set of similar instances. A prototype can be an instance part of the training set , or can lie anywhere in the space of the dataset

. Having only prototypes among the observed points is desirable for interpretability, but it can also improve the classification error. As an example of a prototype we can consider the record minimizing the sum of the distances with all the other points of a set (like in K-Medoids) or the record generated averaging the value of the features of a set of points (like in K-Means)

[tan2006introduction]. Different definitions and requirements to find a prototype are specified in each work using the prototypes to explain the black box.

In [bien2011prototype], Bien et al. design the transparent Prototype Selection (PS) approach that first seeks for the best prototype (two strategies are proposed), and then assigns the points in to the label corresponding to the prototype. In particular, they face the problem of recognizing hand written digits. In this approach, every instance can be described by more than one prototype, and more than a prototype can refer to the same label (e.g., there can be more than one prototype for digit zero, more than one for digit one, etc.). The comprehensible predictor provides a global explanation in which every instance must have a prototype corresponding to its label in its neighborhood; no instances should have a prototype with a different label in its neighborhood, and there should be as few prototypes as possible.

Kim et al. in [kim2014bayesian, kim2015mind] design the Bayesian Case Model (BCM) comprehensible predictor able to learn prototypes by clustering the data and to learn subspaces. Each prototype is the representative sample of a given cluster, while the subspaces are set of features which are important in identifying the cluster prototype. That is, the global explanator

returns a set of prototypes together with their fundamental features. Possible drawbacks of this approach are the high number of parameters (e.g., number of clusters) and various types of probability distributions which are assumed to be correct for each type of data.

[kim2015ibcm] proposes an extension of BCM which exploits humans interaction to improve the prototypes. Finally, in [kim2016examples] the approach is further expanded to include criticisms, where a criticism is an instance that does not fit the model very well, i.e., a counter-example part of the cluster of a prototype.

With respect to prototypes and DNN, [mahendran2015understanding] proposes a method to change the image representations in order to use only information from the original image representation and from a generic natural image prior. This task is mainly related to image reconstruction rather than black box explanation, but it is realized with the aim of understanding the example to which the DNN is related to producing a certain prediction by realizing a sort of artificial image prototype. There is a significant amount of work in understanding the representation of DNN by means of artifact images, [kato2014image, vondrick2013hoggles, weinzaepfel2011reconstructing, zeiler2014visualizing].

We conclude this section presenting how [fong2017interpretable] deals with artifacts in DNNs. Finding a single representative prototype by perturbation, deletion, preservation, and similar approaches has the risk of triggering artifacts of the black box. As discussed in Section 8.3, NN and DNN are known to be affected by surprising artifacts. For example, [kurakin2016adversarial] shows that a nearly-invisible image perturbation can lead a NN to classify an object for another; [nguyen2015deep] constructs abstract synthetic images that are classified arbitrarily; [mahendran2015understanding] finds deconstructed versions of an image which are indistinguishable from the viewpoint of the DNN from the original image, and also with respect to texts [liang2017deep] inserts typos and random sentences in real texts that are classified arbitrarily. These examples demonstrate that it is possible to find particular inputs that can drive the DNN to generate nonsensical or unexpected outputs. While not all artifacts look “unnatural”, nevertheless they form a subset of images that are sampled with negligible probability when the network is normally operated. In our opinion, two guidelines should be followed to avoid such artifacts in generating explanations for DNNs, and for every black box in general. The first one is that powerful explanations should, just like any predictor, generalize as much as possible. Second, the artifacts should not be representative of natural perturbations.

9.3 Explanation via Other Approaches

In the following we present solutions for the transparent box design problem adopting approaches that can not be categorized as the previous ones. [kononenko2010efficient] describes a method

based on Naive Bayes aimed to explain individual predictions

of black boxes . The proposed approach exploits notions from

coalitional game theory

, and explains predictions utilizing the contribution of the value of different individual features (see Figure 2). The method is agnostic with respect to the black box used and is tested only on tabular data. Finally, in [wang2015trading] Wang et al. propose a method named OT-SpAMs based on oblique tree sparse additive models for obtaining a global interpretable predictor as a decision tree. OT-SpAMs divides the feature space into regions using a sparse oblique tree splitting and assigns local sparse additive experts (leaf of the tree) to individual regions. Basically, OT-SpAMs passes from complicated trees/linear models to an explainable tree .

10 Conclusion

In this paper we have presented a comprehensive overview of methods proposed in the literature for explaining decision systems based on opaque and obscure machine learning models. First, we have identified the different components of the family of the explanation problems. In particular, we have provided a formal definition of each problem belonging to that family capturing for each one the proper peculiarity. We have named these problems: black box model explanation problem, black box outcome explanation problem, black box inspection problem and transparent box design problem. Then, we have proposed a classification of methods studied in the literature which take into account the following dimensions: the specific explanation problem addressed, the type of explanator adopted, the black box model opened, and the type of data used as input by the black box model.

As shown in this paper, a considerable amount of work has already been done in different scientific communities and especially in the machine learning and data mining communities. The first one is mostly focused on describing how the black boxes work, while the second one is more interested into explaining the decisions even without understanding the details on how the opaque decision systems work in general.

The analysis of the literature conducted in this paper has led to the conclusion that despite many approaches have been proposed to explain black boxes, some important scientific questions still remain unanswered. One of the most important open problems is that, until now, there is no agreement on what an explanation is. Indeed, some works provide as explanation a set of rules, others a decision tree, others a prototype (especially in the context of images). It is evident that the research activity in this field completely ignored the importance of studying a general and common formalism for defining an explanation, identifying which are the properties that an explanation should guarantee, e.g., soundness, completeness, compactness and comprehensibility. Concerning this last property, there is no work that seriously addresses the problem of quantifying the grade of comprehensibility of an explanation for humans, although it is of fundamental importance. The study of measures able to capture this aspect is challenging because it also consider also aspects like the expertise of the user or the amount of time available to understand the explanation. The definition of a (mathematical) formalism for explanations and of tools for measuring how much an explanation is comprehensible for humans would improve the practical applicability of most of the approaches presented in this paper.

Moreover, there are other open research questions related to black boxes and explanations that are starting to be treated by the scientific community and that deserve attention and more investigation.

A common assumption of all categories of works presented in this paper is that the features used by the black box decision system are completely known. However, a black box might use additional information besides that explicitly asked to the user. For example, it might link the user’s information with different data sources for augmenting the data to be exploited for the prediction. Therefore, an important aspect to be investigated is to understand how an explanation might also be derived in cases where black box systems make decisions in presence of latent features. An interesting starting point for this research direction is the framework proposed in [lakkaraju2017selective] by Lakkaraju et al. for the evaluation of the prediction models performances on labeled data where the decision of decision-makers (either humans or black-boxes) is taken in the presence of unobserved features.

Another open research question is related to providing explanations in the field of recommender systems. When a suggestion is provided to a customer, it should come together with the reasons for this recommendation. In [mcsherry2005explanation] the authors define a case-based reasoning approach to generate recommendations with the opportunity of obtaining both the explanation of the recommendation process and of the produced recommendations.

Lastly, a further interesting point is the fact that explanations are important on their own and predictors might be learned directly from explanations. A starting study of this aspect is [krening2017learning] that presents a software agent learned to simulate the Mario Bros game only utilizing explanations rather than the logs of previous plays.

Appendix 0.A Supplementary Materials

Problem References
Model Explanation [craven1996extracting], [krishnan1999extracting], [boz2002extracting], [johansson2009evolving], [chipman1998making], [gibbons2013cad], [zhou2016interpreting], [schetinin2007confident], [hara2016making], [tan2016tree], [andrews1995survey], [craven1994using], [johansson2003rule], [zhou2003extracting], [augasta2012reverse], [nunez2002rule], [fung2005rule], [martens2007comprehensible], [deng2014interpreting], [lou2013accurate], [henelius2014peek], [krishnan2017palm], [tolomei2017interpretable]
Outcome Explanation [xu2015show], [fong2017interpretable], [zhou2016learning], [selvaraju2016grad], [lei2016rationalizing], [ribeiro2016nothing], [turner2016model]
Black Box Inspection [olden2002illuminating], [baehrens2010explain], [sundararajan2017axiomatic], [cortez2011opening], [hooker2004discovering], [goldstein2015peeking], [krause2016interacting], [adler2016auditing], [adebayo2016iterative], [yosinski2015understanding], [thiagarajan2016treeview], [shwartz2017opening], [radford2017learning]
Transparent Design [yin2003cpar], [wang2015falling], [letham2015interpretable], [su2015interpretable], [lakkaraju2016interpretable], [wang2016bayesian], [malioutov2017learning], [bien2011prototype], [kim2014bayesian], [mahendran2015understanding], [kononenko2010efficient], [wang2015trading]
Table 3: Summary of methods for opening and explaining black boxes with respect to the problem faced.
Explanator References
Decition Tree (DT) [craven1996extracting], [krishnan1999extracting], [boz2002extracting], [johansson2009evolving], [chipman1998making], [domingos1998knowledge], [gibbons2013cad], [zhou2016interpreting], [schetinin2007confident], [hara2016making], [tan2016tree], [krishnan2017palm], [thiagarajan2016treeview], [wang2015trading]
Decision Rules (DR) [andrews1995survey], [craven1994using], [johansson2003rule], [zhou2003extracting], [augasta2012reverse], [nunez2002rule], [fung2005rule], [martens2007comprehensible], [deng2014interpreting], [turner2016model], [yin2003cpar], [wang2015falling], [letham2015interpretable], [su2015interpretable], [lakkaraju2016interpretable], [wang2016bayesian], [malioutov2017learning]
Features Importance (FI) [lou2013accurate], [henelius2014peek], [tolomei2017interpretable], [ribeiro2016nothing]
Saliency Mask (SM) [xu2015show], [fong2017interpretable], [zhou2016learning], [selvaraju2016grad], [lei2016rationalizing]
Sensitivity Analysis (SA) [olden2002illuminating], [baehrens2010explain], [sundararajan2017axiomatic], [cortez2011opening]
Partial Dependence Plot (PDP) [hooker2004discovering], [goldstein2015peeking], [krause2016interacting], [adler2016auditing], [adebayo2016iterative]
Neurons Activation (NA) [yosinski2015understanding], [shwartz2017opening], [radford2017learning]
Prototype Selection (PS) [bien2011prototype], [kim2014bayesian], [mahendran2015understanding]
Table 4: Summary of methods for opening and explaining black boxes with respect to the explanator adopted.
Black Box References
Neural Network (NN) [craven1996extracting], [krishnan1999extracting], [boz2002extracting], [johansson2009evolving], [domingos1998knowledge], [andrews1995survey], [craven1994using], [johansson2003rule], [zhou2003extracting], [augasta2012reverse], [olden2002illuminating]
Tree Ensemble (TE) [chipman1998making], [domingos1998knowledge], [gibbons2013cad], [zhou2016interpreting], [schetinin2007confident], [hara2016making], [tan2016tree], [deng2014interpreting], [tolomei2017interpretable]
Support Vector Machines (SVM) [nunez2002rule], [fung2005rule], [martens2007comprehensible]
Deep Neural Network (DNN) [xu2015show], [fong2017interpretable], [zhou2016learning], [selvaraju2016grad], [lei2016rationalizing], [sundararajan2017axiomatic], [yosinski2015understanding], [thiagarajan2016treeview], [shwartz2017opening], [radford2017learning]
Agnostic black box (AGN) [lou2013accurate], [henelius2014peek], [krishnan2017palm], [ribeiro2016nothing], [turner2016model], [baehrens2010explain], [cortez2011opening], [hooker2004discovering], [goldstein2015peeking], [krause2016interacting], [adler2016auditing], [adebayo2016iterative]
Table 5: Summary of methods for opening and explaining black boxes with respect to the black box explained.


This work is partially supported by the European Community’s H2020 Program under the funding scheme “INFRAIA-1-2014-2015: Research Infrastructures”, grant agreement 654024, SoBigData, http://www.sobigdata.eu.