Explainability in Human-Agent Systems

by   Avi Rosenfeld, et al.
Lev Academic Center ( JCT )

This paper presents a taxonomy of explainability in Human-Agent Systems. We consider fundamental questions about the Why, Who, What, When and How of explainability. First, we define explainability, and its relationship to the related terms of interpretability, transparency, explicitness, and faithfulness. These definitions allow us to answer why explainability is needed in the system, whom it is geared to and what explanations can be generated to meet this need. We then consider when the user should be presented with this information. Last, we consider how objective and subjective measures can be used to evaluate the entire system. This last question is the most encompassing as it will need to evaluate all other issues regarding explainability.



There are no comments yet.


page 1

page 2

page 3

page 4


The Need for Standardized Explainability

Explainable AI (XAI) is paramount in industry-grade AI; however existing...

Explainability Auditing for Intelligent Systems: A Rationale for Multi-Disciplinary Perspectives

National and international guidelines for trustworthy artificial intelli...

Towards Explainability for a Civilian UAV Fleet Management using an Agent-based Approach

This paper presents an initial design concept and specification of a civ...

A light-weight method to foster the (Grad)CAM interpretability and explainability of classification networks

We consider a light-weight method which allows to improve the explainabi...

Explainable Agents Through Social Cues: A Review

How to provide explanations has experienced a surge of interest in Human...

On quantitative aspects of model interpretability

Despite the growing body of work in interpretable machine learning, it r...

The Bouncer Problem: Challenges to Remote Explainability

The concept of explainability is envisioned to satisfy society's demands...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the field of Artificial Intelligence matures and becomes ubiquitous, there is a growing emergence of systems where people and agents work together. These systems, often called Human-Agent Systems or Human-Agent Cooperatives, have moved from theory to reality in the many forms, including digital personal assistants, recommendation systems, training and tutoring systems, service robots, chat bots, planning systems and self-driving cars

[6, 9, 13, 16, 41, 65, 72, 80, 103, 104, 106, 107, 113, 120, 124, 132, 134, 141]. One key question surrounding these systems is the type and quality of the information that must be shared between the agents and the human-users during their interactions.

This paper focuses on one aspect of this human-agent interaction — the internal level of explainability that agents using machine learning must have regarding the decisions they make. The overall goal of this paper is to provide an extensive study of this issue in Human-Agent Systems. Towards this goal, our first step is to formally and clearly define explainability in Section

2, as well as the concepts of interpretability, transparency, explicitness, and faithfulness that make a system explainable. Through using these definitions, we provide a clear taxonomy regarding the Why, Who, What, When, and How about explainability and stress the relationship of interpretability, transparency, explicitness, and faithfulness to each of these issues.

Overall, we believe that the solutions presented to all of these issues need to be considered in tandem as they are intertwined. The type of explainability needed directly depends on the motivation for the type of human-agent system being implemented and thus directly stems from the first question about the overall reason, or reasons, for why the system must be explainable. Assuming that the system is human-centric, as is the case in recommendation [72, 141], training [132], and tutoring systems [6, 134], then the information will likely need to persuade the person to choose a certain action, for example through arguments about the agent’s decision [105], its policy [104] or presentation [10] . If the system is agent-centric, such as in knowledge discovery or self-driving cars, the agent might need to provide information about its decision to help convince the human participant of the correctness of their solution, aiding in the adoption of these agent based technologies [102]. In both cases, the information the agent provides should build trust to ensure its decisions are accepted [48, 53, 65, 83]. Furthermore, these explanations might be necessary for legal considerations [37, 138]. In all cases we need to consider and then evaluate how these explanations were generated, presented, and if their level of detail correctly matches the system’s need, something we address in Section 7.

This paper is structured as follows. First, in Section 2, we provide definitions for the terms of explainability, interpretability, transparency, fairness, explicitness and faithfulness and discuss the relationship between these terms. Based on these definitions, in Section 3 we present a taxonomy of three possibilities for why explainability might be needed, ranging from not helpful, beneficial and critical. In Section 4, we suggest three possible targets for who the explanation is geared for: “regular users", “expert users", or entities external to the users of the system. In Section 5, we address what

mechanism is used to create explanations. We consider six possibilities: directly from the machine learning algorithm, using feature selection and analysis, through a tool separate from the learning algorithm to model all definitions, a tool to explain a specific outcome, visualization tools and prototype analysis. In Section

6 we address when the generated explanations should be presented: before, after and/or during the task execution. In Section 7 we introduce a general framework to evaluate explanations. Section 8 includes a discussion about the taxonomy presented in the paper, including a table summarizing previous works and how they relate. Section 9 concludes.

2 Definitions of Explainable Systems and Related Terms

Several works have focused on the definitions of a system’s explainability and also the related definitions of interpretability, transparency, fairness, explicitness and faithfulness. As we demonstrate in this section, of all of these terms, we believe that the objective of making a system explainable is the most central and important for three reasons. Chronologically, this term was introduced first and thus has largest research history. Second, and possibly due to the first factor, this is the most general term. As we explain in this section, a system’s level of explainability is created through the interpretations that the agent provides. These interpretable elements can be transparent, fair, explicit, and/or faithful. Last, and most importantly, this term connotes the key objective for the system: facilitating the human user’s understanding of the agent’s logic.

2.1 Theoretical Foundations for Explainability

It has been noted that a thorough study of the term explanation would need to start with Aristotle as since his time it has been noted that explanations and causal reasoning are intrinsically intertwined [61]. Specific to computer systems, as early as 1982, expert systems such as MYCIN and NEOMYCIN were developed for encoding the logical process within complex systems [24, 25]. The objective of these systems, as is still the case, was to provide a set of clear explanations for a complex process. However, no clear definitions for the nature of what constituted an explanation was provided.

Work by Gregor and Benbasat in 1999 defined the nature of explainability within “intelligent" or “knowledge-based" systems as a “declaration of the meaning of words spoken, actions, motives, etc., with a view to adjusting a misunderstanding or reconciling differences" [51]. As they point out in their paper, this definition assumes that the explanation is provided by the provider of the information, in our case the intelligent agent, and that the explanation is geared to resolve some type of misunderstanding or disagreement. This definition is in line with other work that assumed that explanations were needed to help understand a system malfunction, an anomaly or to resolve conflict between the system and the user [47, 99, 117]. Given this definition, it is not surprising that the first agent explanations were basic reasoning traces that assume the user will understand the technical information provided, without taking a user other than the system designer into account. As these explanations are not typically processed beyond the raw logic of the system, they are referred to as “naïve explanations" by previous work [126]. In our opinion, explainability of this type is more appropriate for system debugging than for other uses.

Possibly more generally, the Philosophy of Science community also provided several definitions of explainability. Most similar to the previous definition, work by Schank [117] specifies that explanations address anomalies where a person is faced with a situation that does not fit her internalized model of the world. This type of definition can be thought of as goal-based, as the goal of the explanation is to address a specific need (e.g. disharmony within a user’s internalized model) [126]. Thus, explanations focus on an operational goal of addressing why the system isn’t functioning as expected.

A second theory by van Fraassen [133] claims that an explanation is always an answer to an implicit or explicit why-question comparing two or more possibilities. As such, an explanation provides information about why possibility was chosen and not options [126, 133]. This definition suggests a minimum criteria any explanation must fulfill, namely that it facilitates a user choosing a specific option , as well as a framework for understanding explanations as answers to why-questions contrasting two or more states [126]. One limitation of this approach is that the provided explanation has no use beyond helping the user understand why possibility was preferable relative to other possibilities.

Most generally, a third theory by Achinstein [2] focuses on explanations as a process of communication between people. Here, the goal of an explanation is to provide the knowledge a recipient requests from a designated sender. Accordingly, this theory does not necessarily require a complete explanation if the system’s user does not require it. Consider a previously described example [127]

that a neural network is trained to compare two pictures of a certain type and can give a similarity measure, e.g. from 0 to 1, and most people cannot understand how it came up with this score. Presenting the pictures to the user so she can validate the similarity for herself can itself serve as an explanation. As the very definition of a proper explanation is dependent on the interaction between the sender and the receiver, such an explanation is sufficient. Similarly, explanations can be motivated by many situations and not exclusively van Fraassen’s why-questions. Conversely, a proper definition can and should be limited only to the information needed to address the receiver’s request.

2.2 The Need for Precisely Defining Explainability in Human-Agent Systems

Recently, questions have arose as to the definition of explainability of machine learning and agent systems. An explosive growth of interest has been registered within various research communities as is evident by workshops on: Explanation-aware Computing (ExaCt), Fairness, Accountability, and Transparency (FAT-ML), Workshop on Human Interpretability in Machine Learning (WHI), Interpretable ML for Complex Systems, Workshop on Explainable AI, Human-Centred Machine Learning, and Explainable Smart Systems [1]. However, no consensus exists about the meaning of various terms related to explainability including interpretability, transparency, explicitness, and faithfulness. It has been pointed out that the Oxford English dictionary does not have a definition for the term “explainable" [36]. One definition for an explanation that has been suggested as a, “statement or account that makes something clear; a reason or justification given for an action or belief" is not always true for systems that claim to be explainable [36]. Thus, providing an accepted and unified definition of explainability and other related terms is of great importance.

Part of the confusion is likely complicated by the fact that the terms, “explainability, interpretability and transparency" are often used synonymously while other researchers implicitly define these terms differently [36, 37, 48, 53, 87, 115]. Artificial intelligence researchers tend to use the term Explainable AI (XAI) [1, 54]

, and focus on how explainable an artificial intelligence (XAI) system is without necessarily directly addressing the machine learning algorithms. For example, work on explainable planning, which they coin XAIP, takes a system view of planning without considering any machine learning algorithms. They distance themselves from machine learning and deep learning systems which they claim are still far from being explainable


In contrast, the machine learning community often focuses on the “interpretability" of a machine learning system by focusing on how a machine learning algorithm makes its decisions and how interpretations can be derived either directly or secondarily from the machine learning component [37, 85, 111, 136, 139]. However, this term is equally poorly defined. In fact, one paper has gone so far as to recently write that, “at present, interpretability has no formal technical meaning" and that, “the term interpretability holds no agreed upon meaning, and yet machine learning conferences frequently publish papers which wield the term in a quasimathematical way” [87]. In these papers, there is no syntactical technical difference between interpretable and explainable systems, as both terms refer to aspects of providing information to a human actor about the agent’s decision-making process. Previous work generally defined interpretability as the ability to explain or present the decisions of a machine learning system using understandable terms [37]. More technically, Montanavon et al. propose that “an interpretation is the mapping of an abstract concept (e.g. a predicted class) into a domain that the human can make sense of" which in turn forms explanations [96]. Similarly, Doran et al. define interpretability as “a system where a user cannot only see, but also study and understand how inputs are mathematically mapped to outputs." To them, the opposite of interpretable systems are “opaque" or “black box" systems which yield no insight about the mapping between a decision and the inputs that yielded that decision [36].

Within the Machine Learning / Agent community, transparency has been informally defined to be the opposite of opacity or “blackbox-ness" [87]. In order to clarify the difference between interpretability and transparency, we build upon the definition of transparency as an explanation about how the system reached its conclusion [127]. More formally, transparency has been defined as a decision model where the decision-making process can be directly understood without any additional information [53]

. It is generally accepted that certain decision models are inherently transparent and others are not. For example, decision trees, and especially relatively small decision trees, are transparent, while deep neural networks cannot be understood without the aid of a explanation tool outside that of the decision process

[53]. We consider this difference in the next section and again in Section 5.

2.3 Formal Definitions for Explainability, Interpretability and Transparency in Human-Agent Systems

This paper’s first contribution is a clear definition for explainability and for the related terms: interpretability and transparency. In defining these terms we also define how explicitness and faithfulness are used within the context of Human-Agent Systems. A summary of these definitions is found in Table 1.

In defining these terms, we focus on the features and records that are used as training input in the system, the supervised targets that need to be identified, and the machine learning algorithm used by the agent. We define as the machine learning algorithm that is created from a set of training records, . Each record contains values for a tuple of ordered features, . Each feature is defined as . Thus, the entire training set consists of . For example: Assume that the features are: (years), (cm), (kg), so . A possible record might be . While this model naturally lends itself to tabular data, it can as easily be applied to other forms of input such as texts, whereby are strings, or images whereby are pixels. The objective of is to properly fit with regard to the labeled targets .

Short Description
One field within the input.
A collection of one item of information (e.g. picture, row in datasheet).
The labelled category to be learned. Can be categorical or numeric.
The algorithm used to predict the value of from the collection of data
(all features and records).
A function that takes as its input and
and returns a representation of ’s logic.
The human-centric objective for the user to understand using .
The extent to which is understandable to the intended user.
The lack of bias in for a field of importance (e.g. gender, age, ethnicity).
The extent to which the logic within is similar to that of .
Why the user should accept ’s decision.
Not necessarily faithful as no connection assumed between and .
The connection between and is both explicit and faithful.
Table 1: Notation and short definition of key concepts of explainability, interpretability, transparency, fairness, and explicitness in this paper. Concepts of features, records, targets and machine learning algorithms and explanations are also included as they define the key concepts.

We define explainability as the ability for the human user to understand the agent’s logic. This definition is consistent with several papers that considered the difference between explainability and interpretability within Human-Agent Systems. For example, Doran et al. define explainable systems as those that explains the decision-making process of a model using reasoning about the most human-understandable features of the input data [36]. Following their logic, interpretability and transparency can help form explanations, but are only part of the process. Guidotti et al. state that “an interpretable model is required to provide an explanation" [53], thus an explanation is obtained by the means of an interpretable model. Similarly, Montanavon et al., define explanations as “a collection of features of the interpretable domain, that have contributed for a given example to produce a decision" [96].

Thus, the objective of any system is explainability, meaning it has an explanation , which is the human-centric aim to understand . An explanation is derived based on the human user’s understanding about the connection between and . The user will create based on her understanding of an interpretation function, that takes as its inputs , and and returns a representation of the logic within that can be understood. Consequently, in this paper we refer to explainability of systems as the understanding the human user has achieved from the explanation and do not use this term interchangeably with “interpretability" and “transparency". We reserve use of terms “interpretability" and “transparency" as descriptions of the agent’s logic. Specifically, we define as:


We claim that the connection between and , , and will also determine the type of explanation that is generated. A globally explainable model provides an explanation for all outcomes within taking into consideration , thus using all information in: . A locally explainable model provides explanations for a specific outcome, (and by extension for specific records ), using as input.

We use three additional terms: explicitness, faithfulness and justification to quantify the relationship of to and respectively. Following recent work [5], we refer to explicitness as the level to which the output of is immediate and understandable. As we further explore in the next section, the level of explicitness depends on who the target of the explanation is and what is the level of her expertise at understanding . It is likely that two users will obtain different values for even given the same value for , making quantifying ’s explicitness difficult due to this level of subjectivity. We define faithfulness, also previously defined as fidelity [66, 102], as the degree to which the logic within is similar to that of . Especially within less faithful models, a concept of completeness was recently suggested to refer to the ability of to provide an accurate description for all possible actions of [48]. Given the similarity of these terms, we only use the term faithful due to its general connotation. Justification was previously defined as an explanation about why a decision is correct without any information about the logic about how it was made [16]. According to this definition, justifications can be generated even within non-interpretable systems. Consequently, justification requires no connection between and and no faithfulness. Instead, justification methods are likely to provide implicit or explicit arguments about the correctness of the agent’s decision, such as through persuasive argumentation [144].

In order for a model to be transparent, two elements are needed: the decision-making model must be readily understood by the user, and that explanation must map directly to how the decision is made. More precisely, a transparent explanation is one where the connection between , and is explicit and faithful as the logic within is readily understandable and identical to , e.g. . When a tool or model is used to provide information about the decision-making process secondary to , the system contains elements of interpretability, but not transparency.

Section 5 discusses the different types of interpretations that can be generated, including transparent ones. Non-transparent interpretations will lack faithfulness, explicitness, or both. Examples include tools to create model and outcome interpretations, feature analysis, visualization methods and prototype analysis. Each of these methods will focus on different parameters within the input, and their relationship to . Model and outcome interpretation tools create without a direct connection to the logic in . Feature Analysis is a method of providing interpretations via analyzing a subset of features . Prototype selection is a method of providing interpretations via analyzing a subset of records . Visualization tools are used to understand the connection between and and thus takes this interpretable form.

To help visualize the relationship between explainability, interpretability and transparency, please note Figure 1. Note that interpretability includes six methods, including transparent models, and also the non-transparent possibilities of model and outcome tools, feature analysis, visualization methods, and prototype analysis. In the figure, interpretability points to the objective of explainability to signify that interpretability is a means for providing explainability, as per these terms’ definitions in Table 1. Note the overlaps within the figure. Feature analysis can serve as a basis for creating transparent models, on its own as a method of interpretability, or as a interpretable component within model, outcome and visualization tools. Similarly, visualization tools can help explain the entire model as a global solution or as a localized interpretable element for specific outcomes of . Prototype analysis uses as the basis for interpretability, and not , and can be used for visualization and/or outcome analysis of . We explore these points further in Section 5.

Figure 1: A Venn Diagram of the relationship between Explainability, Interpretability and Transparency. Notice the centrality of Feature Analysis to 4 of the 5 interpretable elements.

The level of interpretability and transparency needed within an explanation will be connected to either hard or soft-constraints defined by the user’s requirements. At times, there may be a hard-constraint based on a legal requirement for transparency, or a soft-constraint that transparency exist in cases where one suspects that the agent made a mistake or does not understand why the agent chose one possibility over others [51, 53, 117, 133, 138]. Explainability can be important for other reasons, including building trust between the user and system even when mistakes were not made [21]– something we now explore.

3 Why a Human-Agent System should be Explainable?

We believe that the single most important question one must ask about this topic is Why we need an explanation, and how important it is for the user to understand the agent’s logic. In answering this question one must establish whether a system truly needs to be explainable. We posit that one can generalize the need for explainability with a taxonomy of three levels:

  1. Not helpful

  2. Beneficial

  3. Critical

Adjustable autonomy is a well-established concept within human-agent and human-robot groups that refers to the amount of control an agent/robot has compared to the human-user [50, 116, 143]. Under this approach, the need for explainability can be viewed as a function of the degree of cooperation between the agent to the human user. Assuming the agent is fully controlled by the human operator (e.g. teleoperated), then no explainability is needed as the agent is fully an extension of the human participant. Conversely, if the robot is given full control, particularly if the reason for the decision is obvious (a recommendation agent gives advice based on a well-established collaborative filtering algorithm), it again serves to reason that no explainability is needed. Additionally, Doshi-Velez and Kim pointed out that an explanation at times is not needed if there are no significant consequences for unacceptable results or the agent’s decision is universally accepted and trusted [37].

At the other extreme, many Human-Agent Systems are built whereby the agent’s role is to support a human’s task. In many of these cases, we argue that the agent’s explanation is a critical element within the system. The need for an agent to be transparent or to explicitly and faithfully explain its actions is tied directly to task execution. For example, Intelligent Tutoring Systems (ITS) typically use step-based granularities of interaction whereby the agent confirms one skill has been learned or uses hints to guide the human participant [134]. The system must provide concrete explanations for its guidance (called hints in ITS terminology) to better guide the user. Similarly, explanations form a critical component of many negotiation, training, and argumentation systems [101, 105, 109, 124, 132]. For example, effective explanations might be critical to aid a person in making the final life-or-death decision within Human-Agent Systems [124]. Rosenfeld’s et al.’s NegoChat-A negotiation agent uses arguments to present the logic behind its position [109]. Traum et al. explained the justification within choices of their training agent to better convince the trainee, as well as to teach the factors to look at in making decisions [132]. Rosenfeld and Kraus created agents that use argumentation to better persuade people to engage in positive behaviors, such as choosing healthier foods to eat [105]. Azaria et al. demonstrate how an agent that learns the best presentation method for proposals given to a user improves their acceptance rate [10]. Many of these systems can be generally described as Decision Support Systems (DSS). A DSS is typically defined as helping people make semi-structured decisions requiring some human judgment and at the same time with some agreement on the solution method [3]. An agent’s effective explanation is critical within a DSS as the system’s goal is providing the information to help facilitate improved user decisions.

A middle category in our taxonomy exists when an explanation is beneficial, but not critical. The Merrian-Webster dictionary defines beneficial as something that “produces good or helpful results"111https://www.merriam-webster.com/dictionary/benefit. In general, the defining characteristic of explanations within this category is that they are not needed in order for the system to behave optimally or with peak efficiency.

To date, many reasons have been suggested for making systems explainable [1, 36, 37, 51, 53, 87, 127]:

  1. To justify its decisions so the human participant can decide to accept them (provide control)

  2. To explain the agent’s choices to guarantee safety concerns are met

  3. To build trust in the agent’s choices, especially if a mistake is suspected or the human operator does not have experience with the system

  4. To explain the agent’s choices to ensure fair, ethical, and/or legal decisions are made

  5. Knowledge / scientific discovery

  6. To explain the agent’s choices to better evaluate or debug the system in previously unconsidered situations

The importance of these types of explanations will likely vary greatly across systems. If the user will not accept the system without this explanation, then a critical need for explainability exists. This can particularly be the case in Human-Agent Systems where the agent supports a life-or-death task, such as search and rescue or medical diagnostic systems, where ultimately the person is tasked with the final decisions [65]. In these types of tasks the explanation is critical to facilitate a person’s decision whether to accept the agent’s suggestion and/or to allow that person to decide if safety concerns are met, such as a patient’s health or that of a person at-risk in a rescue situation. In other situations, explanations are beneficial for the overall function of the human-agent system, but are not critical.

One key and common example where explanations can range in significance from critical to beneficial are situations where explanations help instill trust. Previous work on trust, within people in a work situation, identified two types of trust that develop over time, “knowledge-based" and “identification-based" [86]. Of the two types of trust, they claim that knowledge-based trust requires less time, interactions and information to develop as it is grounded primarily in the other party’s predictability. Identification-based trust requires a mutual understanding about the other’s desires and intention and requires more information, interactions and time to develop.

We posit that previous work has focused on elements of this trust model in identifying what types of explanations are necessary to foster this type of trust within Human-Agent Systems. Following our previous definitions of interpretability and transparency, it seems that the former type of interpretable elements may be sufficient for knowledge-based definitions of trust, while transparent elements are required for identification-based models. When a person has not yet developed enough positive experience with the agent she interacts with, both knowledge-based and identification based trust are missing. As it has been previously noted that people are slow to adopt systems that they do not understand and trust and “if the users do not trust a model or a prediction they will not use it." [102], even providing a non-transparent interpretable explanation will likely help instill confidence about the system’s predictability, thus facilitating the user’s knowledge-based trust in the system. Ribeiro et al. demonstrate how interpretability of this type is important for identifying models that have high accuracy for the wrong reasons [102]

. For example, they show that text classification often are wrongly based on the heading rather than the content. In contrast, image classifiers that capture the main part of the image in a similar manner to the human eye, install a feeling that the model is functioning correctly even if accuracy is not particularly high.

However, it has been claimed that when the person suspects the agent has made a mistake and/or is unreliable then the agent should act with transparency, and not merely be interpretable, as explanations generated from transparent methods will aid the user to trust the agent in the future [41]. In extreme cases, if the user completely disregards the agent, then the human-agent system breaks down, making transparent explanations critical to help restore trust. Furthermore, explanations based on ’s transparency may be needed to help facilitate the higher level of identification-based trust. Only transparent interpretations directly link and thus providing full information about the agent’s intention. We suggest that designers of systems that require this higher level of trust, such as health-care [32], recommender systems [73], planning [41] and human-robot rescue systems [104, 113] should be transparent, and not merely interpretable.

Other types of explanations are geared towards people beyond the immediate users of the system. Examples of these types of explanations include those designed for legal and policy experts to confirm that the decisions / actions of agent fulfill legal requirements such as being fair and ethical [36, 37, 39, 45, 53]. Both the EU and UK governments have adopted guidelines requiring agent designers to provide users information about agents’ decisions. In the words of the EU’s “General Data Protection Regulation" (GDPR), users are legally entitled to obtain “meaningful explanation of the logic involved" of these decisions. Additional legislation exists to ensure that agents are not biased against any ethnic or gender groups [37, 53] such that they demonstrate fairness [39]. Similarly, the ACM has published guidelines for algorithmic accountability and transparency [45]. The system’s explanation is not here critical for effective performance of the agent, but instead to confirm that a secondary legal requirement is being met.

Explanations geared beyond the immediate user can also be those geared for researchers to help facilitate scientific knowledge discovery [37, 53] or for system designers to evaluate or test a system [37, 126, 127]. For example, a medical diagnostic system may work with peak efficiency exclusively as a black box, and users may be willing to rely on this black box as the agent is trusted due to an exemplary historical record. Nonetheless, explanations can be still be helpful for knowledge discovery to help researchers understand gain understanding of various medical phenomena. Explainability has also been suggested as being necessary for properly evaluating a system or for the agent’s designer to confirm that the system is properly functioning, even within situations that were not considered when the agent was built. For example, Doshi-Velez and Kim claimed that due to the inherent inability to quantify all possible situations, it is impossible for a system designer to evaluate an agent in all possible situations [37]. Explanations can be useful in these situations to help make evident any possible gaps between an agent’s formulation and implementation and its performance. In all cases, the explanation is not geared to the end-user of the system, but rather to an expert user who requires the explanation for a reason beyond the day-to-day operation of the system.

As we have shown in this section, the question about explainability can be divided into questions about its necessity, e.g. not necessary, beneficial or critical, which is directly connected to the objective of that explanation. From a user-perspective, the primary objective of the explanation is related to factors that help her use the system, and particularly elements that help foster trust. In these cases, a system may need to be transparent, even if this level of explanation entails a sacrifice of the system’s performance. We further explore this possibility and relationship in Section 5. At times, explanations are needed or beneficial for entities beyond the typical end-user such as for the designer, researcher or legal expert. As the objective of explanations of this type is different, it stands to reason that the type of explanation may be fundamentally different based on whom the target is for this information, something we address in the next section. This in turn may impact the type of interpretation the agent must present, something we explore in Section 5.

4 Who is the Target of the Explanation?

The type of interpretable element needed to form the basis of the explanation is highly dependent on the question of who the explanation is for. We suggest three possibilities:

  1. Regular user

  2. Expert user

  3. External entity

The level of explanation detail needed depends on why that Human-Agent Systems needs the user to understand the agent’s logic (Section 3) and how the explanation has been generated (Section 5). If the need for explanation is for legal purposes, then it follows that legal experts need the explanation, and not the regular user. Similarly, it stands to reason that the type of explanation that is given should be directed specifically to this population. If the purpose of the explanation is to support experts’ knowledge discovery, then it stands to reason that the explanation should be directed towards researchers with knowledge of a specific problem. In these cases, the system might not even need to present their explanations to the regular users and may thus only focus on presenting information to these experts. Most systems will still likely benefit by directing explanations to the regular users to help them better understand the system’s decisions, thus aiding in their acceptance and/or trust. In these cases, the system should be focused on providing justifications in addition to providing the logic behind their decisions through arguments [105, 144] and/or through Case Based Reasoning [27, 70, 79] that help reassure the user about the correctness of the agent’s decision.

The same explanation might be considered as extremely helpful by a system developer, but considered useless by a regular user. Thus, the expertise level of the target will play a large part of defining an explanation and how explicit is. Deciding on how to generate and present will be covered in later sections.

Similarly, what level of detail constitutes an adequate explanation likely depends on precisely how long the user will study the explanation. If the goal is knowledge discovery and/or complying with legal requirements, then an expert will likely need to spend large amounts of time meticulously studying the inner-workings of the decision-making process. In these cases, it seems likely that great amounts of detail regarding the justification of the explanations will be necessary. If a regular user is assumed, and the goal is to build user trust and understanding, then shorter, very directed explanations are likely more beneficial. This issue touches upon a larger issue about the danger additional information may overload a given user [122].

At times, the recipient of the explanation is not the user directly interacting with the system. This is true in cases where explanations are mandated by an external regulative entity, such as is proposed by the EU’s GDPR, regulation. In this case, the system must follow explanation guidelines provided by the external entity. In contrast, developers providing explanations to users will typically follow different guidelines, such as user usability studies. As these two types of explanations are not exclusive, it is possible that the agent will generate multiple types of explanations for the different targets (e.g. the user and the regulator entity). In certain types of systems, such as security systems, multiple potential targets of the explanation also exist. Vigano and Magazzeni explain that security systems have many possible targets, such as the designer, the attacker, and the analyst [137]. Obviously an explanation provided for the designer can be very dangerous in the hands of an attacker. Thus, aside from the question of how “helpful" an explanation is for a certain types of user, one must consider what the implications of providing an unsuitable explanation are. In these cases, the explanation must be provided for a given user while also considering the implications on the system’s security goals.

5 What Interpretation can be Generated?

Once we have established the why and who about explanations, a key related question one must address is what interpretation can be generated as the basis for the required explanation. Different users will need different types of explanations, and the interpretations required for effective explanations will differ accordingly [137]. We posit that six basic approaches exist as to how interpretations can be generated:

  1. Directly from a transparent machine learning algorithm

  2. Feature selection and/or analysis of the inputs

  3. Using an algorithm to create a post-hoc model tool

  4. Using an algorithm to create a post-hoc outcome tool

  5. Using an interpretation algorithm to create a post-hoc visualization of the agent’s logic

  6. Using an interpretation algorithm to provide post-hoc support for the agent’s logic via prototypes

In Figure 2 we describe how these various methods for generating interpretations have different degrees of faithfulness and explicitness. Each of these methods contains some level of trade-off between their explicitness and faithfulness. For example, as described in Section 2.3, transparent models are inherently more explicit and faithful than other possibilities. Nonetheless, we present this figure only as a guideline, as many implementations and possibilities exists within each of these six basic approaches. These differences will impact the levels of both faithfulness and explicitness, something we indicate via the arrows pointing to both higher levels of faithfulness and explicitness for a specific implementation.

Figure 2: Faithfulness versus explicitness within the six basic approaches for generating interpretations

5.1 Generating Transparent Interpretations Directly from Machine Learning Algorithms

The first approach, and the most explicit and faithful method, is to generate directly from the output of the machine learning algorithm, . These types of interpretations can be considered ante-hoc, or “before this" (e.g. an explanation is needed), as the this type of connection between and facilitates providing interpretations at any point, including as the task is being performed [4, 62]

. These transparent algorithms, often called white box algorithms, include decision trees, rule-based methods, k-nn (k-nearest neighbor), Bayesian and logistic regression

[38]. As per our definitions in Section 2, these algorithms have not been designed for generating interpretations, but can be readily derived from the understandable logic inherent in the algorithms. As we explain in this section, all of these algorithms are faithful, and are explicit to varying degrees. A clear downside to these approaches is that one is then limited to these machine learning algorithms, and/or a specific algorithmic implementation. It has been previously noted that an inverse relationship often exists between machine learning algorithms’ accuracy and their explainability [53, 54]

. Black box algorithms, especially deep neural networks but including other less explainable algorithms such as ensemble methods and support vector machines, are often used due to their exceptional accuracy on some problems. However, these types of algorithms are difficult to glean explicit interpretations from and are typically not transparent

[38]. Figure 3 is based on previous work [33, 54] and quantifies the general relationship between algorithms’ explicitness and accuracy. This figure describes the relationships as they stand at the time the paper is written, and may change as algorithmic solutions develop and evolve. Additionally, this figure may be somewhat over-simplified, as we now describe.

Figure 3: Typical trade-off between prediction accuracy versus explicitness

Decision trees are often cited to be the most understandable (e.g. explicit) [33, 37, 42, 54, 100]. The hierarchical structure inherent in decision trees yields itself to understanding which attributes are most important, of second-most importance, etc. [42]. Furthermore, assuming the size of the tree is relatively small due to Occam’s Razor [98], the if-then rules that can be derived directly from decision trees are both particularly explicit and faithful [42, 108].

However, in practice not all decision trees are easily understood. Large decision trees with hundreds of nodes and leaves are often more accurate than smaller ones, despite the assumption inherent within Occam’s Razor [98]. Such trees are less explicit, especially if they contain many attributes and/or multiple instances of nodes using the same attribute for different conditions. Assuming the decision tree is too large to fully understand (e.g. thousands of rules) [58] and/or overfitted due to noise in the training data [42], it will lose its explicitness. One approach to address this issue is suggested by Last and Maimon [81] where they reason about the added value of added attributes versus the complexity they add, facilitating more explicit models.

Classification rules [26, 92] have also been suggested as a highly explicit machine learning model [33, 42, 54]. As is the case with decision trees, the if-then rules within such models provide faithful interpretations and are potentially explicit. The flat, non-hierarchical structure in such models can be an advantage in allowing the user to focus on individual rules separately which at times has been shown to be advantageous [24, 42]. However, in contrast to decision trees, this structure does not inherently give a person insight as to the relative importance of the rules within the system. Furthermore, conflicts between rules need to be handled, often through an ordered rule-list, adding to the model’s complexity and reducing its level of explicitness.

Nearest neighbor algorithms, such as k-nn, can potentially be transparent machine learning models as they can provide interpretations based on the similarity between an item needing interpretation and other similar items. This is reminiscent of the picture classification example in Section 2 as the person is actually performing an analysis similar to k-nn in understanding why certain pictures are similar. This process is also similar to the logic within certain Case Based Reasoning algorithms which often also use logic akin to k-nn algorithms to provide an interpretation for why two items are similar [127]. However, as has been previously pointed out, these interpretations are only typically explicit if k is kept small, e.g. k=1 or close to 1 [127]

. Furthermore, k-nn is a lazy model that classifies each new instance separately. As such, every instance could potentially have a different “interpretation", making this a local interpretation. In contrast, both decision trees and rule-based systems construct general rules that are to be applied across all instances

[42]. In addition, if the number of attributes in the dataset are very large, it might be difficult for a person to appreciate the similarities and differences between different instances again reducing the explicitness of the model.

Bayesian network classifiers have also been suggested as another transparent machine learning model. Knowing the probability of a successful outcome is often needed in many applications, something that probabilistic models, including Bayesian models, excel at [14]. Bayesian models have been previously suggested to be the most transparent of these types of models as each attribute can be independently analyzed and the relative strength of that attribute be understood [42]. This approach is favored in many medical applications for this reason [17, 75, 82]. More complex, non-naïve Bayesian models can be constructed [22] although one may then potentially lose both model accuracy and transparency.

Similar to Bayesian models, logistic regression also outputs outcome probabilities by fitting the output of its regression model to values between 0 and 1. The logit function inherent in this model is also constructed from probabilities– here in the form of log-odds / odds-ratios. This makes this model popular for creating medical applications

[12, 68]. At times, the interpretations that can be generated by these relationships are explicit [38].

Support Vector Machines (SVM) are based on finding a hyperplane to separate between different instances and are potentially explicit, particularly if a linear kernel is used

[14]. Once again, if many attributes exist in the model, the explicitness of the model might be limited even if a linear kernel is used. An SVM becomes even less explicit if more complex kernels are used including RBF and polynomial kernels. As is the case with the last three of these algorithms (SVM, k-nn and Bayesian), feature selection / reduction could significantly help the explicitness of the model, something we explore in the next section.

As no one algorithm provides both high accuracy and explicitness, it is important to consider new machine learning algorithms that include explainability as a consideration within the learning algorithm. One example of this approach is work by Kim, Rudin and Shah, who have suggested a Bayesian Case Model for case-based reasoning [70]. Another example is introduced by Lou et al. [89]. Their generalized additive models (GAMs) combine univariate models called shape functions through a linear function. On one hand, the shape functions can be arbitrarily complex, making GAMs more accurate than simple linear models. On the other hand, GAMs do not contain any interactions between features, making them more explicit than black box models. Lou et al. also suggested adding selected terms of interacting pairs of features to standard GAMs [90]. This method increases the accuracy of the models, while maintaining better explicitness than black box methods. Caruana et al. propose a extension of the GAM, GAM, which considers pairwise interactions between features and provide a case study showing its success in accurately and transparently explaining a health-care dataset [20].

We believe these approaches are worthy of further consideration and provide an important future research area as new combinations of machine learning algorithms that provide both high accuracy and explainability could potentially be developed. Several of these methods use an element of feature analysis as the basis of their transparency [20, 81, 89, 90]. In general, feature selection can be a critical element in creating transparent and non-transparent interpretations, as we now detail.

5.2 Generating Interpretations from Feature Selection / Analysis

A second approach to create the interpretation, , is through performing feature selection and/or feature analysis of all features,

, before or after a model has been built. Theoretically, this approach can be used alone and exclusively to generate interpretations within the non-transparent “black box" algorithms, or in conjunction with the above “white box" algorithms to help further increase their explicitness. Feature selection has long been established as an effective way of building potentially better models which are simpler and thus better overcome the curse of dimensionality

[56]. Additionally, models with fewer attributes are potentially more explicit as the true causal relationship between the dependent and independent variables is clearer and thus easier to present to the user [76]. The strong advantage of this approach is that the information presented to the user is generated directly from the mathematical relationship between a small set of features and the target being learned.

Three basic types of feature selection approaches exist: filters, wrappers, and embedded methods. We believe that filter methods are typically best suited for generating explicit interpretations as the analysis is derived directly from the data without any connection to a specific machine learning model [56, 112]. Univariate scores such as information gain or can be used to evaluate each of the attributes independently. Either the top features could then be selected or only those with a score above a previously defined threshold. The user’s attention could then be focused on relationships between these attribute, facilitating explicitness. Multivariate filters, such as CFS [57] allow us to potentially discover interconnections between attributes. The user’s attention could again then be focused on this small subset of features with the assumption that interrelationships between features have become more explicit. Previous work by Vellido et al. [136]

recommends using principals component analysis (PCA) to generate interpretations. Not only does PCA reduce the number of attributes needing to be considered, but the new features generated by PCA are linear combinations of the original ones. As such, the user could understand an explanation based on these interrelationships, especially if the both the number and size of these derived features are small.

As filter methods are independent of the machine learning algorithm used, it has been suggested that this approach can be used in conjunction with black box algorithms to make them more explicit [135]. One example is previous work that used feature selection to reduce the number of features from nearly 200 to 3 before using a neural network for classification [135]. As neural networks are becoming increasing popular due to their superior accuracy in many datasets, we believe this is a general approach that is worth consideration to help make neural networks more explicit.

5.3 Tools to Generate Model Interpretations Independently from L

The above methods are faithful in that the transparent algorithms and feature analysis is done in conjunction with . However, other approaches exist that create as a process independent of the logic within . In the best case, does faithfully approximate the actual and complete logic within , albeit found differently, and thus represents a form of reverse-engineering version of the logic within [8]. Even when is not 100% faithful, the goal is to be as faithful and explicit as possible, making these approaches a type of metacognition process, or reasoning about the reasoning process (e.g. ) [29]. A key difference within the remaining approaches in this section is that is created through an analysis after the ’s learning has been done, something referred to as postprocessing [129] or post-hoc analysis [87, 96]. Examples of post-hoc approaches that we consider in the remainder of this section include: model and outcome interpretations, visualization, and prototyping similar records.

While disconnecting the and can lead to a loss of faithfulness, it can lead to other benefits and challenges. Designing tools that focus on could potentially lead to very explicit models, something we represent in Figure 2. Additionally, interpretations that are derived directly from the machine learning algorithm or the features are strongly restricted by the nature of the algorithm / features. In contrast, interpretations that are created in addition to the decision-making algorithm can be made to comply with various standards. For example, Miller demonstrates how interpretations are often created by the same people that develop the system. They tend to generate explanations that are understandable to software designers, but are not explicit for the system’s users [93]. He suggests using insights from the social sciences when discussing explainability in AI. Other factors, such as legal and practical considerations might limit researchers as to what constitutes a sufficient explanation. For example, as these tools disconnect the logic in from , they cannot guarantee the fairness of the agent’s decision which may be a critical need and even require transparency (see Section 3).

The first possibility creates a “model interpretation tool" that is used to explain the logic behind ’s predictions for all values of given all records, . A group of these approaches create simpler, transparent decision trees or rules secondary to . While these approaches will have the highest level of explicitness, they will generally lack faithfulness. For example, Frosst [44] presents a specific interpretation model for neural networks in an attempt to resolve the tension between the generalization of neural networks and the explicitness of decision trees. They show how to use a deep neural network to train a decision tree. The new model does not perform as well as a neural network, but is explicit. Many other approaches have used decision trees to provide explanations for neural networks [18, 31, 78], decision rules [7, 8, 30, 67, 150]

and a combination of genetic algorithms with decision trees or rules

[7, 66, 94]. Similarly, decision trees [23, 35, 149] and decision rules [34, 58, 118, 130] have been suggested to explain tree ensembles.

Some explanations secondary to are generated by using feature analysis and thus are most similar to the approaches in the previous section. One example of these algorithms is SP-LIME, which provides explanations that are independent of the type of machine learning algorithm used [102]

. It is noteworthy that SP-LIME includes feature engineering as part of its analysis, showing the potential connection between the second and third approaches. The feature engineering in SP-LIME tweaks examples that are tagged as positive and observes how changing them affects the classification. A similar method has been used to show how Random Forests can be made explainable

[131, 140]. The Random Forest can be considered a black box that determines the class of a given feature set. ’s interpretabity is obtained by determining how the different features contribute to the classification of a feature set [131], or even which features should be changed, and how, in order to obtain a different classification [140]. This type of interpretation is extremely valuable. For example, consider a set of medical features, such as weight, blood pressure, age etc. and a model to determine heart attack risk. Assume that for a specific feature set the model classifies the patient as high risk. The model’s interpretation facilitates knowing what parameters need to change in order to change the prediction to low risk.

5.4 Tools to Generate Outcome Interpretations Independently from L

The second possibility for creating interpretations independently from creates an “outcome explanation" that is localized and explains the prediction for a given instance and its prediction, . It has been claimed that feature selection approaches are useful for obtaining a general, global understanding of the model, but not for specific classifications of an instance, . Consequently, they advocate using local interpretations [11]. One example is an approach that uses vectors which are constructed independently of the learning algorithm for generating localized interpretations [11]

. Another example advocates using coalition game theory to evaluate the effect of combinations of features for predicting

[129]. Work by Lundberg and Lee present a unified framework for interpreting predictions using Shapley game theoretic functions [91]. Certain algorithms have both localized and global versions. One example is the local algorithm LIME and its global variant, SP-LIME [102].

5.5 Algorithms to Visualize the Algorithm’s Decision

While the explanations in the previous sections focused on ways a person could better understand the logic within , visualization techniques typically focus on explaining how a subset of features within are connected to . However, the level of explicitness within visualization is lower than that of feature selection and model and outcome interpretations. This is because feature selection and model and outcome interpretations all aim to understand the logic within , thus giving them relatively higher level of faithfulness and explicitness. As visualization tools do not focus on understanding the logic within , they are less faithful than feature analysis methods that do, and at times the level of understanding they provide is not high, especially for regular users.

Overall, many of these approaches seem to have the primary goal of justification for a specific outcome of and are not focused on even localized interpretations of ’s logic. As justification is more concerned with persuading a user that a decision is correct than providing information about ’s logic [16], it seems that justification methods likely have the least amount of faithfulness, as there is no need to make any direct connection between and . Consistent with this aim, work by Lei, Barzilay and Jaakkola generated rationales, which they defined as justifications for an agent’s local decision through creating a visualization tool that highlighted which sections of text, e.g. , were responsible for making a specific classification [84].

Consider explanations that can potentially be generated within image classification, a task many visualization tools address [40, 55, 125, 142, 148]. A visualization tool will typically identify the portion of the picture (a subset of ) that was most responsible for yielding a prediction, . However, typical visualizations, such as those generated by saliency masks, class activation mapping, sensitivity analysis and partial dependency plots all only focus on highlighting important portions of input, without explaining the logic within the model, and the output is often hard for regular users to understand. Nonetheless, these approaches are useful in explaining high accuracy, low-explicitness machine learning algorithms, particularly neural networks, often within image classification tasks.

Saliency maps are a visualizations that identify important, e.g. salient, objects which are groups of features [142]. In general, saliency can be defined as identifying the region of an image , that will identify [40]. For example, a picture may include several items, such as a person, house and car. can represent the car and is used to properly identify it (). Somewhat similar to the previous types of explanations, these salient features could then generate a textual explanation of an image. For example, Xu et al. focused on identifying objects within a deep neural network (CNN) for picture identification to automatically create text descriptions [142] for a given picture (outcome description). Kim et al. created textual explanation for neural networks of self-driving cars [71]. More generally, saliency masks can be used to identify the areas that represent the targets that were identified in the picture [40, 55, 125, 142, 148]. They generally use the gradient of the output corresponding to the each of the targets with respect to the inputted features [87]. While earlier works constrained the neural network to provide this level of explicitness [148], recent works provide visual explanations without altering the structure of the neural network [40, 64, 119]. Still, serious concerns exist that many of these visualizations are too complex for regular users and thus reserved for experts, as some of these explanations are only appropriate for people researching the under-workings of the algorithm to diagnose and understand mistakes [119].

Neural activation is a visualization for the inspection of neural networks that help focus a person to what neurons are activated with respect to particular input records. As opposed to the previous visualizations that focus on

and , this visualization helps provide an understanding about neural networks’ decisions making them less of a black box. Consequently, these approaches provide interpretation and not justification and are more faithful. For example, work by Yosinski et al. [145] proposes two tools for visualizing and understanding what computations and neuron activations occur in the intermediate layers of deep neural networks (DNN). Work by Schwartz-Ziv and Tishby suggest using a Information Plane visualization which captures the Mutual Information values that each layer preserves regarding the input and output variables of DNNs [123].

Other visualizations exist for other machine learning algorithms and learning tasks. Similar to saliency maps, sensitivity analysis provides a visualization that connects the inputs and outputs of [114]. Moreover, sensitivity analysis maps have been applied to tasks beyond image classification and to other black box machine learning algorithms such as ensemble trees [28]. For example, Coretz and Embrechts present five sensitivity analysis methods appropriate for both classification and regression tasks [28]

. Zhang and Wallace present a sensitivity analysis for convolutional neural networks used in text classification


Partial Dependency Plots (PDP) help visualize the average partial relationship between the predicted response of and one or more features within [43, 49]. PDPs use feature analysis as a critical part of their interpretation, and are much more faithful and explicit than many of the other visualizations approaches in this section. However, as the primary output and interpretation tool is visual [43], we have categorized it in this section. Examples include work by Hooker that uses ANOVA decomposition to help create this a Variable Interaction Network (VIN) visualization [63] and work by Goldstein et al. that extend the more classic PDP model by graphing the functional relationship between the predicted response and the feature for individual observations, thus making this a localized visualization [49]. Similarly, Krause et al. provide a localized visualization to create partial dependence bars, a color bar representation of a PDP [77].

5.6 Generating Explanations from Prototyping the Dataset’s Input as Examples

Similar to visualization tools, prototype selection also seeks to clarify the link between ’s input and output. However, while visualization tools focus on the input from , prototyping focuses on , seeking the existence of a subset of records similar to record, , being classified. This subset is meant to serve as an implicit explanation as to the correctness of the model as prototyping aims to find the minimal subset of input records that can serve as a distillation or condensed view of the dataset [15].

Prototypes have been shown to help people better understand ’s decisions. For example, work by Henricks et al. focuses on providing visual explanations for images that include class-discriminate information about other images that share common characteristics with the image being classified [60]. The assumption here is that the information about similar pictures in the same class helps people better understand the decision of the algorithm. Bien and Tibshirani propose two methods for generating prototypes– a LP relaxation with randomized rounding and a greedy approach [15]. Work by Kim et al. suggested using maximum mean discrepancy to generate prototypes [69]. In other work by Kim et al., they suggest using a Bayesian Case Model (BCM) to generate prototypes [70].

5.7 Comparing the Six Basic Approaches for Generating Interpretations

Referring back to Figure 2, each of these approaches will differ along the axis of their level of explicitness and faithfulness. It has been previously noted that many of the visualization approaches produce interpretations that are not easily understood by people without an expert-level understanding of the problem being solved [96] making them not very explicit. As they often provide justification and no direct interpretation of the logic in , they are also not very faithful. As prototypes provide examples of similar classifications, they are often more explicit than visualizations as regular users can more easily understand their meaning. However, as they also do not attempt to directly explain ’s logic, they are not more faithful. Other approaches, such as transparent ones, have high levels of both explicitness and faithfulness, but are typically limited to white box methods that facilitate these types of interpretability. Model and outcome tool approaches can potentially be geared to any user, making them very explicit, but are less faithful as the logic generated in is not necessarily the same as that in . When taken in combination with a white box algorithm, feature analysis methods can be very explicit and faithful. At times, they are used independently of , potentially making them less faithful.

Referring back to Figure 1, each of the approaches described in this section are labeled with the term within the explainability model described in Section 2.3. However, note the overlaps within the Venn Diagram as overlaps do exist between some of the approaches described in this section. While transparent approaches do link and , sometimes the link between these two elements is strengthened and/or described through an analysis of the commonly seen in Feature Analysis approaches. For example, the GAM and GAM approaches [20, 89] use univariate and pairwise feature analysis methods respectively in their transparent models. While model outcome models such as SP-LIME pride themselves on being agnostic, e.g. no direct connection be assumed between and , they do use elements of feature analysis and visualization in creating their global interpretation of [102]. Similarly, the outcome explanation model, LIME, also uses feature analysis and visualization in creating its local interpretations of [102] for an instance of . Saliency maps are visualization that is based on identifying the features used for classifying a given picture [40] showing the potential overlap between visualization methods and feature analysis. However, at times, the identified salient features are used to create a outcome interpretation, as is the case in other work [142]. Similarly, work by Lei, Barzilay and Jaakkola generated visualizations of outcomes through analyzing which features were most useful to the model, again showing the intersection of these three approaches. Last, some prototype analysis tools, such as work by Henricks et al. use visual methods [60]. Thus, we stress that the different types of interpretation approaches are often complementary and not mutually exclusive.

Given these differences of the explicitness and faithfulness of each of these approaches, it seems logical that the type of interface used for disseminated the system’s interpretation will likely depend upon the level of the user’s expertise and the type of interpretation that was generated. The idea of adaptable interfaces based on people’s expertise was previously noted [52, 121, 128]. In these systems, the type of information presented in the interface depends on the user’s level of expertise. Accordingly, an interface might consider different types of interpretation or interpretation algorithms based on who the end-user will be. Even among experts, it is reasonable to assume that different users will need different types of information. The different backgrounds of legal experts, scientists, safety engineers, and researchers may necessitate different types of interfaces [37].

6 When Should Information be Presented?

Explanations can be categorized based on when the interpretation is presented:

  1. Before the task

  2. Continuing explanations throughout the task

  3. After the task

Some agents may present their interpretation before the task is executed as either justification [16], conceptualization or proof of fairness of an agent’s intended action [39]. Other agents may present their explanation during task execution, especially if this information is important to explain when the agent fails so it will be trusted to correct the error [65, 48, 53]. Other agents provide explanations after actions are carried out [80], to be used for retrospective reports [87].

It is important to note that not all approaches for what can be generated, as per Section 5 support all of these possibilities. While all methods can be used for analysis after the task, many of these methods use post-hoc analysis that separates from . Thus, if fairness needs to be checked before task execution, the lack of connection between from in model and outcome explanations, visualizations, and prototypes make this difficult to accurately check. Transparent methods could fulfill this requirement due to their inherent faithfulness. Feature analysis methods including, but not limited to GAM, GAM, and PDP [20, 43, 49, 90] can check the connection between inputs and outputs, thus confirming fairness or other legal requirements are met even before task execution.

The choice of when to present the explanation is not exclusive. Agents might supply various explanations at various times, before, during and after the task is carried out. Building on the taxonomy in Section 3, if explainability is critical for the system to begin functioning, then it stands to reason that this knowledge must be presented at the beginning of the task, thus enabling the user to determine whether to accept the agent’s recommendation [120]. However, if it is beneficial to build trust / user acceptance, then it might be directed during the task, especially if the agent erred. If the purpose of the explanation is to justify the agent’s choice from a legal perspective then we may need to certify that decision before the agent acts (preventative) or after the act (accusatory). But, if the goal is conceptualization, especially in the form of knowledge discovery and/or to support future decisions, then the need for explanation after task execution is equally critical. These possibilities are not inherently mutually exclusive. For example, work by Vigano and Magazzeni [137] claims that explanations should be provided throughout all stages of the systems lifecycle within security systems. They describe how explanations should begin as the system is designed an implemented, continue through use, analysis and change and maybe even when it is replaced. One may argue whether this is crucial for all systems or only for security systems that are discussed in their work, but it is surely a point to consider.

7 How can Explanations be Evaluated?

It was previously noted that little agreement currently exists about how to define explainability and interpretability which may be adding to the difficulty in properly evaluating it [37]. In order to address this point, we first clearly defined these terms in Section 2.3, and then proceeded to consider questions of why, what, when and how based on these definitions.

As we discuss in this section, creating a general evaluation framework is still an open challenge as these issues are often intrinsically connected. For example, the detail of an explanation is often dependent on why that explanation is needed. An expert will likely differ from a regular user regarding why an explanation is needed, will often need these explanations at different times, e.g. before or after the task (when), and may require different types of explanations and interfaces (what and how). At other times multiple facets of explanation exist even within one category. A DSS system is built to support a user’s decision, thus making explainability a critical issue. However, these systems will still likely benefit from better explanations, so that the user trusts those explanations. Similarly, a scientist pursuing knowledge discovery may need to analyze and interact with information presented before, during and after a task’s completion (when). Thus, multiple goals must often be considered and evaluated.

To date, there is little consensus about how to quantify these interconnections. Many works evaluated explainability as a binary value– either it works or it doesn’t. Within these papers, an explanation is inspected to insure that it provides the necessary information in the context of a specific application [84, 102]. If it does so, it is judged as a success, even if other approaches may have been more successful. To help quantify evaluations, Doshi-Velez and Kim suggested a taxonomy of three types of evaluation tasks that can be used for evaluation: application, human, and functionally grounded [37]

. In their model, application grounded tasks are meant for experts attempting to execute a task and the evaluation focuses on how well the task was completed. Human-grounded tasks are simplified and can be performed with regular-users. They conceded that it is not clear what the evaluation goal of this task need be but recommended simplified evaluation metrics, such as reporting which explanation they preferred. Work by Mohseni and Ragan suggested creating canonical datasets of this type that could quantify differences between interpretable algorithms

[95]. They proposed annotating regions in images, and words in texts that provide an explanation. The output of any new interpretation algorithm, , could be compared to the user annotations that provide a ground-truth. This approach is still goal-oriented and thus they classify their task as a human-grounded task (e.g. having match the human explanation). Doshi-Velez and Kim’s last type of evaluation task is functionality-grounded where some objective criteria for evaluation is predefined (e.g. the ratio of decision tree size to model accuracy). The main advantage to this type of evaluation is the evaluation of can be quantified without any need for user studies.

This taxonomy provides three important types of tasks that can be used in evaluate explainability and interpretability, but these researchers do not propose how to quantify the effectiveness of the key components within and . This paper’s main point is that questions surrounding the system’s need for why, what, when and how about explainability must be addressed. These elements can and should be quantified, while also considering trade-offs between these categories as well as elements within . Issues about algorithms’ explicitness, faithfulness and transparency must be explicitly evaluated while balancing the agent’s and user’s performance requirements, including the agent’s fairness and prediction accuracy, and the user’s performance and acceptance of ’s.

Given this, we suggest explicitly evaluating the following three elements in Human-Agent Systems: The quantifiable performance of the agent’s learning, , its level of interpretation, , and human’s understanding, . For example, in a movie recommendation system the three scores would be described as follows: The score of for is based on standard metrics for evaluating recommendation predictions (i.e. accuracy, precision and/or recall). A score can also be given to that reflects how much explicitness, faithfulness and transparency exist in according to objective criteria described below. The score for should be quantified based on the user’s performance. As the goal of the system is to yield predictions that the user understands so they will be accepted, we should quantify the impact of on the user’s behavior. Thus, we suggest an evaluation score that quantifies:

  1. A score for , the performance of the agent’s prediction

  2. A score for , the interpretation given to the user

  3. A score for , the user’s acceptance of

As described in previous sections, a complex interplay exists between these three elements. There is often a trade-off between the performance of and the explicitness of that can be produced from (see Figure 3). White-box algorithms are more explicit and can even be transparent, but typically have lower performance. Higher accuracy algorithms, such as neural networks, are typically less explicit and faithful (see Figure 2). Thus, agents with lower performance scores for will likely have higher scores for , especially if explicitness and faithfulness are important and quantifiable within the system. Furthermore, different user types and interfaces will be effected by the type of agent design and a total measure is needed to weigh all parameters that are needed by a system into account. For example, an agent that was designed to support an expert user is different from one provided to a regular user.

Another equally important element of the system is how well the person executed her system task(s) given . In theory, multiple goals for may exist for the human user such as immediate performance vs. long-term knowledge acquisition. These may be complementary or in conflict. For example, assume the explanation goal of a system is to support a person’s ability to purchase items in a time-constrained environment (e.g. online stock purchasing). The greater detail contained within the agent’s explanations on one hand instill improved confidence within the user, but also will take more time to read and process, which may prevent the user from capitalizing on certain quickly passing market fluctuations. Thus, some measure should likely be introduced to reason about different goals for the explanation and the relative strengths of various explanations, their interfaces, and the algorithms that generate those explanations.

To capture these properties, we propose an overall utility to quantify the complementary and contradictory goals for , and as the weighted product:


We define as the number of goals in an the system. , and each have an overall objective and the system meets this object through all of these goals. The objective for is to provide predictions for using . The ability of the system to meet this objective is measured through machine learning performance metrics that quantify the goals of high accuracy, recall, precision, F-measure, mean average precision, and mean squared error. A goal for can also be that exhibits fairness, which often is a hard-constraint due to legal considerations. The objective for is to provide a representation of ’s logic that is understandable to the user. This success of this objective can be measured by goals for to have the highest levels of explicitness, faithfulness, and transparency. Other papers have suggested additional goals for including justification [74] and completeness [48] that we argue are included in goals of explicitness and faithfulness respectively. The objective for is that the person will understand using . This can be measured through the goal that the user’s performance be improved given . Additional goals include those specified in Section 3 including guaranteeing safety concerns, trust, and knowledge / scientific discovery. Goals such as to the timing of when interpretations were present (e.g. “presenting during task as required") are likely hard-constraints (e.g. either it was done at the correct time or not). is the importance weight we give to the goal such that . Similarly, is the score we give to the goal, and we require that .

While this model helps quantify the interplay of multiple explanation goals, either inherently complimentary or contradictory, a fundamental question lingers about how to set the values of , and . Of these values, we argue that is the easiest to define and should be clearly defined in advance, by addressing the key elements of explainability as defined in Section 2.3 (e.g. fairness, transparency, explicitness, etc.) as well as the questions about why, what, when and how discussed in subsequent sections. Many of these goals are hard-constraints that must be fulfilled. For example, assuming a system must exhibit fairness, but fails to do so, the grade for this goal will be zero. As the overall utility is the product of all goals and their grades, the net utility of the system will be zero as the hard-constraint was not met.

However, and are much more difficult to quantify in many real-world applications, the first category within the taxonomy of Doshi-Velez and Kim [37]. We assume that either users themselves, the system’s designer, or outside third party organizations (e.g. governments) can quantify these values including the trade-offs between them. For example, a system designer may determine that a transparent system is necessary or desirable through setting for this goal. (Assuming transparency is not needed at all, it should be removed as to avoid a zero grade making the net utility be zero.) If a high level of transparency can only be obtained at the cost of a lower accuracy, two conflicting goals exist and the system designer will need to decide the relative importance of both goals, i.e. for each goal, so the optimal trade-off can be found.

Further complicating the evaluation calculation, to date no quantifiable measurements exists for the goals in . In contrast, very quantifiable metrics exist for and to a lesser degree, for those in .

is relatively easily quantifiable through accepted measures such as accuracy, precision, and recall. We suggest that goals within

be measured through known tools for quantifying the user experience as is typically done in HCI studies. In addition to grading the user’s performance, accepted user performance metrics such as the NASA-TLX (Task Load Index) [59] and the System Usability Scale [19] can measure these goals by focusing and the user’s mental workload [110]. Specific to measuring explanations, it has been suggested that simplicity and satisfaction be considered as a potential metrics [46, 88]. However, to date, no accepted measures exist for quantifying the elements of described in this paper. For example, what should the scale for grading an algorithm’s explicitness or faithfulness be? Should they be normalized between 0 and 1? If so, on what basis? This is an open challenge that we believe needs to be addressed in the future.

We agree with previous work that simplified tasks, functional metrics and binary grades should all be used be to help tractably evaluate explainable systems [37, 84, 102]. Examples of simplified tasks include the creation of canonical datasets where an objective truth exists about correct interpretations and explanations [95]. A second approach is to quantify the relative value of different approaches and only give a non-zero score to the one that they feel is best [84, 102]. Thus, the scoring function could be made boolean (e.g. either the interpretation is either explicit or it isn’t), greatly simplifying calculating the system’s total value. or through using binary evaluations. In both cases, questions about how to set can be resolved in this way. A third set of approaches suggests quantifying elements of through functional metrics about the interpretation [37]. The assumption behind this approach is that as the size of machine learning models grow, they are less explicit. Thus, models with large numbers of nodes / hidden layers (e.g. in deep neural networks), parameter values (for regression and SVM models), the number of rules (rule-based models), or the depth (in decision trees) are less preferable to those with fewer numbers of these values [33]. Following Section 5, we could also create an objective metric based on the number of features generated from feature selection that are used to create the model. The advantage to both of these approaches is the value of explanation’s can be set independently of the system’s specific task. A fourth approach is to create simplified accepted tasks or case studies, potentially where simulations of human behavior could be used for repeatedly for evaluation across different algorithms, interfaces, and approaches [37, 36]. Similarly, this could create a standardization for all values of , and , again greatly aiding in the evaluation process. Currently, no such canonical tasks have been universally accepted, leaving this issue as an open challenge.

8 Discussion

Explainability of Human-Agent Systems is a complex matter, involving multiple, and sometime contradicting, aspects. This paper is unique in the integrated approach we take to addressing these questions. In Section 2 we define key terms based on previous work [5, 39, 53, 66] combined with contributions of our own. We then use these terms to analyze extensively why these explanations are needed, if the recipient has a specific skill-level, what is the mechanism for generating these explanations, when it should be presented, what interpretations can be generated and how the entire system can be evaluated.

To help focus the reader on previous contributions and how they relate to this work, Table 2 presents a mapping of papers that we encountered to the main component of each of the fundamental questions regarding explainability we addressed. For each paper that we included in the mapping in Table 2 we provide the citation number, and indicate the questions about explainability they address. Please note that there are no studies that touch on all aspects of explainability presented in this paper. However, we do agree that certain papers touch upon more than one of these topics. In this case, we categorized the paper as per which issue we felt was the paper’s focus. As this is a dynamic field, new papers do exist and we do not claim that this list is exhaustive. However, we believe that all papers, both current and future, can be categorized as per the divisions in this table.

It is interesting to note the contrast between the number of papers aimed at clearly defining these terms in line 2, to those who that questions of why, who when and how in lines 3, 4, 11, and 12 with the number of papers that address the question of what in lines 5–10. Most papers are unfortunately geared to addressing only this issue without focusing on other key points about explainability. This paper’s key point is that these other questions are actually extremely important questions to ask, as they heavily affect the question of what explanation to generate.

Section Papers
Definitions (Sec. 2) [2],[21],[37],[51],[53],[117],[127],[133]
Why (Sec. 3) [37], [41],[53],[102],[137]
Who (Sec. 4) [37],[60],[125],[137]
What Transparent (Sec. 5.1) [12],[14],[17],[20],[22],[26],[33],[38],[42],[54],[58],[68], [70],[75],[81],[82],[89],[90],[92],[97],[100],[127]
Feature analysis (Sec. 5.2) [76],[135],[136]
Model tool (Sec. 5.3) [7],[8],[18],[23],[30],[31],[34],[35],[44],[58],[66],[67], [78],[94],[102],[118],[130],[131],[140],[149],[150]
Outcome tool (Sec. 5.4) [11],[91],[93]
Visualization (Sec. 5.5) [28],[40],[49],[55],[63],[64],[71],[77],[84],[114],[119], [123],[125],[142],[145],[147],[148]
Prototyping (Sec. 5.6) [15],[60],[69],[70]
When (Sec. 6) [80],[87],[120],[137]
How (Sec. 7) [19],[33],[37],[59],[95]
Table 2: Each section in our paper discusses an aspect of explainability. We list the papers discussed for each of these aspects.

To date, several excellent surveys exist on this topic, each of which addresses aspects of the taxonomy and evaluation framework about the Why, Who, What, When and How for this issue [1, 37, 38, 53, 96, 146]. In this paper, we claim all issues regarding explainability are interwoven. This leads to a different analysis of explainability. For example, two recent papers [53, 96] provide excellent coverage of the topic of interpretability of black box models. Another survey is even more specific, focusing on visualization tools for deep neural networks [146]. Conversely, a different survey [38] focuses on white box algorithms. Another survey [1] is written from an HCI perspective with the aim of using topic model to undercover trends within the HCI perspective of explainable systems. A different one focuses on evaluation possibilities [37]. We encourage the reader to study these papers in conjunction with our work as they address specific points within this paper.

We believe it is a mistake to exclusively focus on either white box or black box algorithms as the source of interpretability within explainable systems. While at many times black box models perform better, this metric is only one aspect of the evaluation, the agent’s performance. If the motivation for requiring explainability rises from legal issues, only transparency may fulfill this need as it allows for levels of explicitness and faithfulness not existent within black box methods. Assuming the goal for explainability is to guarantee safety concerns as per Section 3, then this may be a hard-constraint which precludes other methods for generating explanations. If the user is not an expert (as per Section 4), then black box visualization tools are likely not useful as they are typically only readily understood by experts [96]. However, if performance is a larger concern, and the goal of explainability is to build trust, then explanations built upon prototyping as per Section 5 may be sufficient for a regular user, despite their not being explicit or faithful. If the target of explanation is an external legal entity concerned with the algorithm’s faithfulness, then feature analysis is likely sufficient to assuage concerns that a specific set of features are not being abused. In all cases, the evaluation of the algorithm will likely change as the type of user, timing, and type of explanation being generated will likely need to be fundamentally different giving the large variations between the described elements within the explanation.

Looking forward, we identify several open issues that result from our analysis of this issue:

  1. What measurements for explicitness and faithfulness can be created for a given algorithm ?

  2. What canonical tasks can be developed for measuring and ?

  3. What tasks can be identified where one method for what type of interpretation (Section 5) is clearly better?

  4. Can the level of interpretability within black box models equal that of white box ones?

  5. Is justification ever advantageous over interpretable models with higher levels of explicitness and faithfulness?

Previously, Doshi-Velez and Kim [37] posed an open challenge to identify what important factors should be considered in evaluating interpretation and evaluation quality. We believe this paper has clearly identified explicitness and faithfulness as these key elements within . Nonetheless, further work is necessary to quantify these elements, especially as different algorithms described in Section 5 of this paper differ in this regard, even within the six categories we presented. Similarly, even without quantifying these elements, canonical tasks are needed where all six types of these algorithms can be implemented to facilitate the relative performance of these algorithms in the performance goals of , , and . We hope that certain tasks could be identified where one type of algorithm is clearly better in addressing one aspect of the issues we raise, such as trust. While many works exclusively focus on black box interpretation due to the high performance of these agents, it is not clear if these agents are suited for all situations and if the level of interpretability of these algorithms will ever reach that of white box agents. Until this happens their suitability must be questioned for certain situations such as fulfilling legal requirements for explainability or safety. Last, we have intentionally limited our discussion about justification as there is no need for these models to be part of an explainable system. Nonetheless, we do believe that justification is important for many of the goals of explainable systems. One open question is to further study the relationship between justification and explainability such that tasks could be identified when one approach is advantageous over the other, or if hybrid methods that incorporate elements of both approaches should be used.

9 Conclusion

We presented a framework designed to enable comparison and evaluation of explainability in Human-Agent Systems. As Human-Agent Systems are diverse and complex, there is no “one explanation type fits all". Each agent must have its requirements and goals mapped out, and the appropriate explanation chosen. We focused on agents that use machine learning and provided an attempt to define this new field.

Our first contribution is a proposed clear and consistent set of definitions for the key terms of explainability, interpretability, transparency, fairness, explicitness and faithfulness in learning algorithms that interact with people. Using these definitions, we systematically address five questions about explainability: Why, Who, What, When and How can be answered. These questions define the various aspects of the explanation for the system. In designing an agent one must first establish why the system requires explanation, as this will affect the answer to the other questions. Next, one must determine who is the target of this explanation, what type of explanation is needed and when it must be presented. Finally, the question of how to evaluate the explanation must be addressed. For each of the questions we presented possible approaches, and discussed when each possibility is likely most appropriate. Various factors affect the answers to these questions. We discussed how the degree of control of the user over the agent affects the need for explainability. We investigated how different user types might affect the explanation. We also discussed how the type of learning that agents perform will affect the explanation that is provided. We then discussed parameters for when to present the information. Finally we presented an evaluation measure, composed of three elements for comparing systems. Our proposed utility is capable of combining all the aspects of the system: the machine learning algorithm, user performance and the explanation, into a single measure. We discussed the strengths and limitations of our proposed measure. While the measure provides a means for comparison, its main limitation relates to the elements that can potentially be subjective in determining the values of the parameters.

We hope that the definitions presented in this paper will serve as a basis for future studies about the five questions about explainability that we present, particularly in the proper evaluation of explainability in Human-Agent Systems. Furthermore, we hope additional researchers will use this framework for further analysis of new algorithms, including suggesting extensions, in this emerging field. Towards this goal, we identified several open issues based on the analysis presented in this paper.


  • [1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankanhalli. Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 582:1–582:18, 2018.
  • [2] Peter Achinstein. The nature of explanation. Oxford University Press, 1983.
  • [3] Frédéric Adam. Encyclopedia of decision making and decision support technologies, volume 2. IGI Global, 2008.
  • [4] M. A. Ahmad, A. Teredesai, and C. Eckert. Interpretable machine learning in healthcare. In 2018 IEEE International Conference on Healthcare Informatics (ICHI), pages 447–447, 2018.
  • [5] David Alvarez-Melis and Tommi S. Jaakkola. Towards robust interpretability with self-explaining neural networks. CoRR, abs/1806.07538, 2018.
  • [6] Ofra Amir and Kobi Gal. Plan recognition and visualization in exploratory learning environments. ACM Transactions on Interactive Intelligent Systems (TiiS), 3(3):16, 2013.
  • [7] A Duygu Arbatli and H Levent Akin. Rule extraction from trained neural networks using genetic algorithms. Nonlinear Analysis: Theory, Methods & Applications, 30(3):1639–1648, 1997.
  • [8] M. Gethsiyal Augasta and T. Kathirvalavakumar. Reverse engineering the neural networks for rule extraction in classification problems. Neural Process. Lett., 35(2):131–150, April 2012.
  • [9] Amos Azaria, Zinovi Rabinovich, Claudia V Goldman, and Sarit Kraus. Strategic information disclosure to people with multiple alternatives. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):64, 2015.
  • [10] Amos Azaria, Ariella Richardson, and Sarit Kraus. An agent for the prospect presentation problem. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 989–996. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
  • [11] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. J. Mach. Learn. Res., 11:1803–1831, August 2010.
  • [12] Steven C Bagley, Halbert White, and Beatrice A Golomb. Logistic regression in the medical literature:: Standards for use and reporting, with particular attention to one medical domain. Journal of clinical epidemiology, 54(10):979–985, 2001.
  • [13] Samuel Barrett, Avi Rosenfeld, Sarit Kraus, and Peter Stone. Making friends on the fly: Cooperating with new teammates. Artificial Intelligence, 242:132–171, 2017.
  • [14] Riccardo Bellazzi and Blaz Zupan. Predictive data mining in clinical medicine: current issues and guidelines. International journal of medical informatics, 77(2):81–97, 2008.
  • [15] Jacob Bien and Robert Tibshirani. Prototype selection for interpretable classification. The Annals of Applied Statistics, pages 2403–2424, 2011.
  • [16] Or Biran and Courtenay Cotton. Explanation and justification in machine learning: A survey. In IJCAI-17 Workshop on Explainable AI (XAI), 2017.
  • [17] Michael W Kattan J Robert Beck Ivan Bratko Blaz Zupan, Janez Demsar. Machine learning for survival analysis: a case study on recurrence of prostate cancer. Artificial intelligence in medicine, 20(1):59–75, 2000.
  • [18] Olcay Boz. Extracting decision trees from trained neural networks. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 456–461, 2002.
  • [19] John Brooke et al. Sus-a quick and dirty usability scale. Usability evaluation in industry, 189(194):4–7, 1996.
  • [20] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1721–1730, 2015.
  • [21] Jessie Y Chen, Katelyn Procci, Michael Boyce, Julia Wright, Andre Garcia, and Michael Barnes. Situation awareness-based agent transparency. Technical report, Army Research Lab Aberdeen Proving Ground MD Human Research and Engineering Directorate, 2014.
  • [22] Jie Cheng and Russell Greiner. Comparing bayesian network classifiers. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 101–108, 1999.
  • [23] H. A. Chipman, E. I. George, and R. E. Mcculloch. Making sense of a forest of trees. In Proceedings of the 30th Symposium on the Interface, pages 84–92, 1998.
  • [24] William J Clancey. The epistemology of a rule-based expert system—a framework for explanation. Artificial intelligence, 20(3):215–251, 1983.
  • [25] William J Clancey and Reed Letsinger. NEOMYCIN: Reconfiguring a rule-based expert system for application to teaching. Department of Computer Science, Stanford University, 1982.
  • [26] Peter Clark and Tim Niblett. The cn2 induction algorithm. Machine learning, 3(4):261–283, 1989.
  • [27] Juan M Corchado and Rosalía Laza. Constructing deliberative agents with case-based reasoning technology. International Journal of Intelligent Systems, 18(12):1227–1241, 2003.
  • [28] Paulo Cortez and Mark J Embrechts. Using sensitivity analysis and visualization techniques to open black box data mining models. Information Sciences, 225:1–17, 2013.
  • [29] Michael T Cox and Anita Raja. Metareasoning: Thinking about thinking. MIT Press, 2011.
  • [30] Mark W Craven and Jude W Shavlik. Using sampling and queries to extract rules from trained neural networks. In Machine Learning Proceedings 1994, pages 37–45. 1994.
  • [31] Mark W. Craven and Jude W. Shavlik. Extracting tree-structured representations of trained networks. In Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’95, pages 24–30, Cambridge, MA, USA, 1995. MIT Press.
  • [32] David Crockett and Brian Eliason. What is data mining in healthcare?, 2016.
  • [33] Hoa Khanh Dam, Truyen Tran, and Aditya Ghose. Explainable software analytics. CoRR, abs/1802.00603, 2018.
  • [34] Houtao Deng. Interpreting tree ensembles with intrees. arXiv preprint arXiv:1408.5456, 2014.
  • [35] Pedro Domingos. Knowledge discovery via multiple models. Intell. Data Anal., 2(3):187–202, May 1998.
  • [36] Derek Doran, Sarah Schulz, and Tarek R. Besold. What does explainable AI really mean? A new conceptualization of perspectives. In Proceedings of the First International Workshop on Comprehensibility and Explanation in AI and ML, 2017.
  • [37] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
  • [38] Stephan Dreiseitl and Lucila Ohno-Machado. Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics, 35(5-6):352–359, 2002.
  • [39] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
  • [40] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In

    2017 IEEE international conference on computer vision (ICCV)

    , pages 3449–3457, 2017.
  • [41] Maria Fox, Derek Long, and Daniele Magazzeni. Explainable planning. CoRR, abs/1709.10256, 2017.
  • [42] Alex A. Freitas. Comprehensible classification models: A position paper. SIGKDD Explor. Newsl., 15(1):1–10, March 2014.
  • [43] Jerome H Friedman.

    Greedy function approximation: a gradient boosting machine.

    Annals of statistics, pages 1189–1232, 2001.
  • [44] Nicholas Frosst and Geoffrey Hinton. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784, 2017.
  • [45] Simson Garfinkel, Jeanna Matthews, Stuart S. Shapiro, and Jonathan M. Smith. Toward algorithmic transparency and accountability. Commun. ACM, 60(9):5–5, August 2017.
  • [46] Maarten Gelderman. The relation between user satisfaction, usage of information systems and performance. Information & management, 34(1):11–18, 1998.
  • [47] Nigel Gilbert. Explanation and dialogue.

    The Knowledge Engineering Review

    , 4(3):235–247, 1989.
  • [48] Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An approach to evaluating interpretability of machine learning. CoRR, abs/1806.00069, 2018.
  • [49] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin. Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65, 2015.
  • [50] Michael A Goodrich, Dan R Olsen, Jacob W Crandall, and Thomas J Palmer. Experiments in adjustable autonomy. In Proceedings of IJCAI Workshop on Autonomy, Delegation and Control: Interacting with Intelligent Agents, pages 1624–1629. Seattle, WA: American Association for Artificial Intelligence Press, 2001.
  • [51] Shirley Gregor and Izak Benbasat. Explanations from intelligent systems: Theoretical foundations and implications for practice. MIS quarterly, pages 497–530, 1999.
  • [52] Jonathan Grudin. The case against user interface consistency. Communications of the ACM, 32(10):1164–1173, 1989.
  • [53] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM Comput. Surv., 51(5):93:1–93:42, August 2018.
  • [54] David Gunning. Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), 2017.
  • [55] Chenlei Guo and Liming Zhang. A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Processing, 19(1):185–198, 2010.
  • [56] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3:1157–1182, 2003.
  • [57] Mark A. Hall. Correlation-based feature selection for machine learning. Technical report, The University of Waikato, 1999.
  • [58] Satoshi Hara and Kohei Hayashi. Making tree ensembles interpretable. arXiv preprint arXiv:1606.05390, 2016.
  • [59] Sandra G Hart and Lowell E Staveland. Development of nasa-tlx (task load index): Results of empirical and theoretical research. In Advances in psychology, volume 52, pages 139–183. Elsevier, 1988.
  • [60] Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. Generating visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2016.
  • [61] Robert R Hoffman and Gary Klein. Explaining explanation, part 1: theoretical foundations. IEEE Intelligent Systems, (3):68–73, 2017.
  • [62] Andreas Holzinger, Chris Biemann, Constantinos S Pattichis, and Douglas B Kell. What do we need to build explainable ai systems for the medical domain? arXiv preprint arXiv:1712.09923, 2017.
  • [63] Giles Hooker. Discovering additive structure in black box functions. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 575–580, 2004.
  • [64] Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko. Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 53–69, 2018.
  • [65] Nicholas R Jennings, Luc Moreau, David Nicholson, Sarvapali Ramchurn, Stephen Roberts, Tom Rodden, and Alex Rogers. Human-agent collectives. Communications of the ACM, 57(12):80–88, 2014.
  • [66] U. Johansson and L. Niklasson. Evolving decision trees using oracle guides. In 2009 IEEE Symposium on Computational Intelligence and Data Mining, pages 238–244, March 2009.
  • [67] Humar Kahramanli and Novruz Allahverdi. Rule extraction from trained adaptive neural networks using artificial immune systems. Expert Syst. Appl., 36(2):1513–1522, March 2009.
  • [68] Ioannis Katafigiotis, Itay M Sabler, Eliyahu M Heifetz, Avi Rosenfeld, Stavros Sfoungaristos, Amitay Lorber, Arie Latke, Vladimir Yutkin, Guy Hidas, Ezekiel H Landau, et al.

    “stone-less” or negative ureteroscopy. a reality in the endourologic routine or avoidable source of frustration? estimating the risk factors for a negative ureteroscopy.

    Journal of endourology, (To appear), 2018.
  • [69] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems, pages 2280–2288, 2016.
  • [70] Been Kim, Cynthia Rudin, and Julie A Shah. The bayesian case model: A generative approach for case-based reasoning and prototype classification. In Advances in Neural Information Processing Systems, pages 1952–1960, 2014.
  • [71] Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. In Proceedings of the European Conference on Computer Vision (ECCV), pages 563–578, 2018.
  • [72] Akiva Kleinerman, Ariel Rosenfeld, and Sarit Kraus. Providing explanations for recommendations in reciprocal environments. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 22–30. ACM, 2018.
  • [73] Bart P Knijnenburg, Martijn C Willemsen, Zeno Gantner, Hakan Soncu, and Chris Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 22(4-5):441–504, 2012.
  • [74] Anders Kofod-Petersen, Jörg Cassens, and Agnar Aamodt. Explanatory capabilities in the creek knowledge-intensive case-based reasoner. FRONTIERS IN ARTIFICIAL INTELLIGENCE AND APPLICATIONS, 173:28, 2008.
  • [75] Igor Kononenko. Inductive and bayesian learning in medical diagnosis. Applied Artificial Intelligence an International Journal, 7(4):317–337, 1993.
  • [76] Igor Kononenko. Explaining classifications for individual instances. In In Proceedings of IJCAI’99, pages 722–726, 1999.
  • [77] Josua Krause, Adam Perer, and Kenney Ng. Interacting with predictions: Visual inspection of black-box machine learning models. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 5686–5697, 2016.
  • [78] R. Krishnan, G. Sivakumar, and P. Bhattacharya. Extracting decision trees from trained neural networks. Pattern Recognition, 32(12):1999 – 2009, 1999.
  • [79] Oh Byung Kwon and Norman Sadeh. Applying case-based reasoning and multi-agent intelligent system to context-aware comparative shopping. Decision Support Systems, 37(2):199–213, 2004.
  • [80] Pat Langley, Ben Meadows, Mohan Sridharan, and Dongkyu Choi. Explainable agency for intelligent autonomous systems. In AAAI, pages 4762–4764, 2017.
  • [81] Mark Last and Oded Maimon. A compact and accurate model for classification. IEEE Transactions on Knowledge and Data Engineering, 16(2):203–215, 2004.
  • [82] Nada Lavrač. Selected techniques for data mining in medicine. Artificial intelligence in medicine, 16(1):3–23, 1999.
  • [83] John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance. Human factors, 46(1):50–80, 2004.
  • [84] Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , pages 107–117, 2016.
  • [85] Benjamin Letham, Cynthia Rudin, Tyler H McCormick, David Madigan, et al. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350–1371, 2015.
  • [86] Roy Lewicki and Barbara Benedict Bunker. Developing and Maintaining Trust in Working Relations, pages 114–139. 1996.
  • [87] Zachary Chase Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.05390, 2016.
  • [88] Tania Lombrozo. Simplicity and probability in causal explanation. Cognitive psychology, 55(3):232–257, 2007.
  • [89] Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classification and regression. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150–158, 2012.
  • [90] Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 623–631, 2013.
  • [91] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.
  • [92] Ryszard S Michalski and Kenneth A Kaufman. Learning patterns in noisy data: the aq approach. In Advanced Course on Artificial Intelligence, pages 22–38. Springer, 1999.
  • [93] Tim Miller. Explanation in artificial intelligence: insights from the social sciences. arXiv preprint arXiv:1706.07269, 2017.
  • [94] Marghny H. Mohamed. Rules extraction from constructively trained neural networks based on genetic algorithms. Neurocomput., 74(17):3180–3192, October 2011.
  • [95] Sina Mohseni and Eric D Ragan. A human-grounded evaluation benchmark for local explanations of machine learning. arXiv preprint arXiv:1801.05075, 2018.
  • [96] Grégoire Montavon, Wojciech Samek, and Klaus Muller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing: A Review Journal, 73:1–15, 2 2018.
  • [97] Carina Mood. Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European sociological review, 26(1):67–82, 2010.
  • [98] Patrick M Murphy and Michael J Pazzani. Exploring the decision forest: An empirical investigation of occam’s razor in decision tree induction. Journal of Artificial Intelligence Research, 1:257–275, 1993.
  • [99] Andrew Ortony and Derek Partridge. Surprisingness and expectation failure: what’s the difference? In IJCAI, pages 106–108, 1987.
  • [100] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
  • [101] Iyad Rahwan, Liz Sonenberg, and Frank Dignum. Towards interest-based negotiation. In Proceedings of the second international joint conference on Autonomous agents and multiagent systems, pages 773–780, 2003.
  • [102] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016.
  • [103] Ariella Richardson, Sarit Kraus, Patrice L Weiss, and Sara Rosenblum. Coach-cumulative online algorithm for classification of handwriting deficiencies. In AAAI, pages 1725–1730, 2008.
  • [104] Ariel Rosenfeld, Noa Agmon, Oleg Maksimov, and Sarit Kraus. Intelligent agent supporting human–multi-robot team collaboration. Artificial Intelligence, 252:211–231, 2017.
  • [105] Ariel Rosenfeld and Sarit Kraus. Strategical argumentative agent for human persuasion. In ECAI, volume 16, pages 320–329, 2016.
  • [106] Avi Rosenfeld, Zevi Bareket, Claudia V Goldman, Sarit Kraus, David J LeBlanc, and Omer Tsimhoni. Learning driver’s behavior to improve the acceptance of adaptive cruise control. In IAAI, 2012.
  • [107] Avi Rosenfeld, Zevi Bareket, Claudia V Goldman, David J LeBlanc, and Omer Tsimhoni. Learning drivers’ behavior to improve adaptive cruise control. Journal of Intelligent Transportation Systems, 19(1):18–31, 2015.
  • [108] Avi Rosenfeld, Vinay Sehgal, David G. Graham, Matthew R. Banks, Rehan J. Haidry, and Laurence B. Lovat. Using data mining to help detect dysplasia: Extended abstract. In 2014 IEEE International Conference on Software Science, Technology and Engineering,, pages 65–66, 2014.
  • [109] Avi Rosenfeld, Inon Zuckerman, Erel Segal-Halevi, Osnat Drein, and Sarit Kraus. Negochat-a: a chat-based negotiation agent with bounded rationality. Autonomous Agents and Multi-Agent Systems, 30(1):60–81, 2016.
  • [110] Susana Rubio, Eva Díaz, Jesús Martín, and José M Puente. Evaluation of subjective mental workload: A comparison of swat, nasa-tlx, and workload profile methods. Applied Psychology, 53(1):61–86, 2004.
  • [111] Cynthia Rudin. Algorithms for interpretable machine learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1519–1519, 2014.
  • [112] Yvan Saeys, Inaki Inza, and Pedro Larrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517, 2007.
  • [113] Maha Salem, Gabriella Lakatos, Farshid Amirabdollahian, and Kerstin Dautenhahn. Would you trust a (faulty) robot?: Effects of error, task type and personality on human-robot cooperation and trust. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pages 141–148, 2015.
  • [114] Andrea Saltelli. Sensitivity analysis for importance assessment. Risk analysis, 22(3):579–590, 2002.
  • [115] Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296, 2017.
  • [116] Paul Scerri, David Pynadath, and Milind Tambe. Adjustable autonomy in real-world multi-agent environments. In Proceedings of the fifth international conference on Autonomous agents, pages 300–307. ACM, 2001.
  • [117] Roger C Schank. Explanation: A first pass. Experience, memory, and reasoning, pages 139–165, 1986.
  • [118] V. Schetinin, J. E. Fieldsend, D. Partridge, T. J. Coats, W. J. Krzanowski, R. M. Everson, T. C. Bailey, and A. Hernandez. Confident interpretation of bayesian decision tree ensembles for clinical applications. IEEE Transactions on Information Technology in Biomedicine, 11(3):312–319, May 2007.
  • [119] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
  • [120] Raymond Sheh. why did you do that?” explainable intelligent robots. In AAAI Workshop on Human-Aware Artificial Intelligence, 2017.
  • [121] Ben Shneiderman. Promoting universal usability with multi-layer interface design. ACM SIGCAPH Computers and the Physically Handicapped, (73-74):1–8, 2002.
  • [122] Tammar Shrot, Avi Rosenfeld, Jennifer Golbeck, and Sarit Kraus. Crisp: an interruption management algorithm based on collaborative filtering. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 3035–3044, 2014.
  • [123] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  • [124] Maarten Sierhuis, Jeffrey M Bradshaw, Alessandro Acquisti, Ron Van Hoof, Renia Jeffers, and Andrzej Uszok. Human-agent teamwork and adjustable autonomy in practice. In Proceedings of the seventh international symposium on artificial intelligence, robotics and automation in space (I-SAIRAS), 2003.
  • [125] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
  • [126] Frode Sørmo and Jörg Cassens. Explanation goals in case-based reasoning. In Proceedings of the ECCBR 2004 Workshops, number 142-04, pages 165–174, 2004.
  • [127] Frode Sørmo, Jörg Cassens, and Agnar Aamodt. Explanation in case-based reasoning–perspectives and goals. Artificial Intelligence Review, 24(2):109–143, 2005.
  • [128] Sebastian Stein, Enrico H. Gerding, Adrian Nedea, Avi Rosenfeld, and Nicholas R. Jennings. Market interfaces for electric vehicle charging. J. Artif. Intell. Res., 59:175–227, 2017.
  • [129] Erik Strumbelj and Igor Kononenko. An efficient explanation of individual classifications using game theory. J. Mach. Learn. Res., 11:1–18, March 2010.
  • [130] Hui Fen Tan, Giles Hooker, and Martin T Wells. Tree space prototypes: Another look at making tree ensembles interpretable. arXiv preprint arXiv:1611.07115, 2016.
  • [131] Gabriele Tolomei, Fabrizio Silvestri, Andrew Haines, and Mounia Lalmas. Interpretable predictions of tree-based ensembles via actionable feature tweaking. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 465–474, 2017.
  • [132] David Traum, Jeff Rickel, Jonathan Gratch, and Stacy Marsella. Negotiation over tasks in hybrid human-agent teams for simulation-based training. In Proceedings of the second international joint conference on Autonomous agents and multiagent systems, pages 441–448. ACM, 2003.
  • [133] Bas C Van Fraassen. 11 empiricism in the philosophy of science. Images of science: Essays on realism and empiricism, with a reply from Bas C. van Fraassen, page 245, 1985.
  • [134] Kurt VanLehn. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4):197–221, 2011.
  • [135] A Vellido, E Romero, M Julià-Sapé, C Majós, À Moreno-Torres, J Pujol, and C Arús. Robust discrimination of glioblastomas from metastatic brain tumors on the basis of single-voxel 1h mrs. NMR in Biomedicine, 25(6):819–828, 2012.
  • [136] Alfredo Vellido, José David Martín-Guerrero, and Paulo JG Lisboa. Making machine learning models interpretable. In ESANN, volume 12, pages 163–172, 2012.
  • [137] Luca Viganò and Daniele Magazzeni. Explainable security. arXiv preprint arXiv:1807.04178, 2018.
  • [138] Charlotte S Vlek, Henry Prakken, Silja Renooij, and Bart Verheij. A method for explaining bayesian networks for legal evidence with scenarios. Artificial Intelligence and Law, 24(3):285–324, 2016.
  • [139] Tong Wang, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klampfl, and Perry MacNeille. A bayesian framework for learning rule sets for interpretable classification. The Journal of Machine Learning Research, 18(1):2357–2393, 2017.
  • [140] Leanne S Whitmore, Anthe George, and Corey M Hudson. Explicating feature contribution using random forest proximity distances. arXiv preprint arXiv:1807.06572, 2018.
  • [141] Bo Xiao and Izak Benbasat. E-commerce product recommendation agents: use, characteristics, and impact. MIS quarterly, 31(1):137–209, 2007.
  • [142] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
  • [143] Holly A Yanco and Jill Drury. Classifying human-robot interaction: an updated taxonomy. In Systems, Man and Cybernetics, 2004 IEEE International Conference on, volume 3, pages 2841–2846. IEEE, 2004.
  • [144] Fahri Yetim. A framework for organizing justifications for strategic use in adaptive interaction contexts. In ECIS, pages 815–825, 2008.
  • [145] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
  • [146] Quan-shi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering, 19(1):27–39, 2018.
  • [147] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820, 2015.
  • [148] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.

    Learning deep features for discriminative localization.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
  • [149] Yichen Zhou and Giles Hooker. Interpreting models via single tree approximation. arXiv preprint arXiv:1610.09036, 2016.
  • [150] Zhi-Hua Zhou, Yuan Jiang, and Shi-Fu Chen. Extracting symbolic rules from trained neural network ensembles. Ai Communications, 16(1):3–15, 2003.