The rapidly growing adoption of Artificial Intelligence (AI), and Machine Learning (ML) technologies using opaque deep neural networks in particular, has spurred great academic and public interest in explainability to make AI algorithms understandable by people. This issue appears in popular press, industry practices[h2o, arya2019one], regulations [gdpr], as well as hundreds of recent papers published in AI and related disciplines. These XAI works often express an algorithm-centric view, relying on “researchers’ intuition of what constitutes a ‘good’ explanation” [miller2018explanation]
. This is problematic because AI explanations are often demanded by lay users, who may not have deep technical understanding of AI, but hold preconception of what constitutes useful explanations for decisions made in a familiar domain. As an example, one of the most popular approaches to explain a prediction made by a ML classifier, as dozens of XAI algorithms strive to do[guidotti2019survey], is by listing the features with the highest weights contributing to a model’s prediction. For example, a model predicting a patient having the flu may explain by saying “the symptoms of sneeze and headache are contributing to this prediction” [ribeiro2016should]. However, it is questionable whether such an explanation satisfies a doctor’s needs to understand the AI, or adds significant value to a clinical decision-support tool.
To close the gap between XAI algorithms and user needs for effective transparency, the HCI community has called for interdisciplinary collaboration [abdul2018trends] and user-centered approaches to explainability [wang2019designing]. This emerging area of work tends to either build on frameworks of human explanations from social science, or empirically study how explanation features impact user interaction with AI. In this paper, we take a complementary approach by investigating challenges faced by industry practitioners to create explainable AI products, with the goal of identifying gaps between the algorithmic work of XAI and what is needed to address real-world user needs.
Recently, an increasing number of open-source toolkits (e.g.[dalex, h2o, alibi, arya2019one]) are making XAI techniques, which produce various forms of explanation for “black-box” ML models, accessible to practitioners. However, little is known about how to put these techniques from research literature into practice. As we will show, it is challenging work to bridge user needs and technical capabilities to create effective explainabilty features in AI products. This kind of work often falls to those with a bridging role in product teams–the design and user experience (UX) practitioners, whose job involves identifying user needs, communicating with developers and stakeholders, and creating design solutions based on demands and constraints on both sides. We study, therefore, how AI explainability is approached by design and UX practitioners, explore together with them how XAI techniques can be applied in various products, and identify opportunities to better support their work and thus the creation of user-centered explainable AI applications.
Given the early status of XAI in industry practices, we anticipate a lack of established means to uncover user needs or a shared technical understanding. Therefore, we develop a novel probe to ground our investigation, namely an XAI algorithm informed question bank. As an explanation can be seen as an answer to a question [miller2018explanation], we represent user needs for explainability in terms of the questions a user might ask about the AI. Drawn on relevant ML literature and prior work on question-driven explanations in non-ML domains, we create a list of prototypical user questions that can be addressed by current XAI algorithms. These questions thus represent the current availability of algorithmic methods for AI explainability, allowing us to explore how they can be applied in various AI products, and identify their limitations for addressing real-world user needs. Our contributions are threefold:
We provide insights into how user needs for different types of explainability are presented in various AI products. We suggest how these user needs should be understood, prioritized and addressed. We also identify opportunities for future XAI work to better satisfy these user needs.
We summarize current challenges faced by design practitioners to create explainable AI products, including the variability of user needs for explainability, the discrepancies between algorithmic explanations and human explanations and a lack of support for design practices.
We present an extended XAI question bank (Figure 1) by combining algorithm-informed questions and user questions identified in the study. We discuss how it can be used as guidance and tool to support the needs specification work to create user-centered XAI applications.
2.1 Explainable artificial intelligence (XAI)
Although XAI first appeared in expert systems almost four decades ago [clancey1983epistemology, swartout1983xplain], it is gaining widespread visibility as a field focusing on ML interpretability [carvalho2019machine]. The term explainability is used by the research community with varying scope. In much of the ML literature, XAI aims to make the reasons behind a ML model’s decisions comprehensible to humans [guidotti2019survey, lipton2016mythos, ribeiro2016should]. In a broader view, explainability encompasses everything that makes ML models transparent and understandable, including information about the training data, performance, etc. [arya2019one, hohman2019gamut]. Our view aligns with the latter.
Recent papers surveyed this rapidly growing field and identified its key research threads [adadi2018peeking, carvalho2019machine, gilpin2018explaining, guidotti2019survey, mohseni2018survey, ras2018explanation]. We will discuss taxonomies of XAI techniques in the next section. Another core thread is the evaluation of explanations, which answers whether an explanation is good enough, and how to compare different explanations. These questions are not only critical for choosing appropriate XAI techniques, but also underlie the development of intelligent systems that optimize the choice of explanation, such as interactive or personalized explanations [abdul2018trends, schneider2019personalized, weld2018challenge]. Toward this goal, many sought to define the desiderata of XAI [carvalho2019machine, guidotti2019survey, hohman2019gamut, robnik2018perturbation], including fidelity, completeness, stability, etc. Despite the conceptual discussions, there are few established means of quantifying explainability. Partly, the reason is that the effectiveness of an explanation is relative to the recipient, and on a philosophical ground, the question asked [bromberger1992we]. So the same explanation may be seen as more or less comprehensible to different users, or even to the same user engaged in a different understanding. Many therefore advocate that the evaluation of XAI needs to involve real users within the targeted application [doshi2017towards, hoffman2018metrics].
Given its recipient-dependent nature, it is clear that work on XAI must take a human-centered approach. By conducting a literature review in social science on how humans give and receive explanations, Miller identified a list of human-friendly characteristics of explanation that are not given sufficient attention in the algorithmic work of XAI, including contrastiveness (to a counterfactual case), selectivity, social process, focusing on the abnormal, seeking consistency with prior beliefs, etc. Wang et al. [wang2019designing] proposed a conceptual framework to connect XAI techniques and cognitive patterns in human-decision making to guide the design of XAI systems.
With a fundamental interest in creating user-centered technologies, the HCI community is seeing burgeoning efforts around designing and studying user interactions with explainable AI [binns2018s, cai2019effects, cheng2019explaining, dodge2019explaining, hohman2019gamut, kocielnik2019will, lai2018human, rader2018explanations]. As a literature analysis performed by Abdul et al. [abdul2018trends] shows, before this wave of work on ML systems, the HCI community have studied explainable systems in various contexts, most notably context-aware applications [bellotti2001intelligibility, lim2009assessing, lim2010toolkit], recommender systems [herlocker2000explaining], debugging tools [kulesza2015principles] and algorithmic transparency [diakopoulos2015algorithmic, sandvig2014auditing]. Specific to XAI, recent studies largely focused on empirically understanding the effect of explanation features on users’ interaction with and perception of ML systems, usually through controlled lab or field experiments. Notably, although explanations were found to improve user understanding of the AI systems, conclusions about its benefits for user trust and acceptance were mixed [cheng2019explaining, kocielnik2019will, lai2018human, poursabzi2018manipulating], suggesting potential gaps between algorithmic explanations and end user needs.
Our work shares the goal of bridging between the algorithm-centric XAI and user-centered explanations. In contrast to prior work centered around end users, we focus on the people that engage in this bridging work, namely UX and design practitioners. By studying their current practices, we explore the largely undefined design space of XAI and identify challenges for creating explainable AI products.
2.2 Supporting AI practitioners
We join a growing group of scholars studying the work of industry practitioners who create AI products [amershi2019guidelines, boukhelifa2017data, hohman2019gamut, holstein2019improving, muller2019data, rule2018exploration]. By better supporting their work, we can ameliorate downstream usability, ethical and societal problems associated with AI. For example, Boukhelifa et al. [boukhelifa2017data]
interviewed 12 data scientists to understand their coping strategies around uncertainty in data science work, and proposed a process model for uncertainty-aware data analytics. Holstein et al. interviewed 35 ML practitioners to conduct the first investigation of commercial product teams’ challenges for developing fairer ML systems, and identified the disconnect between their needs and the solutions proposed in the fair ML research literature.
Most studies of AI practitioners focused on data scientists. As creating explainable AI products requires a user-centered approach, design practitioners should also perform an indispensable role. Despite a growing body of HCI work on AI technologies, there is a lack of design guidelines for AI systems. One notable exception is a recent paper by Amershi et al. [amershi2019guidelines], which synthesized a set of 18 usability guidelines for AI systems. Several of these guidelines (e.g., make clear what the system can do, how well it can do, why it did what it did) are relevant to explainability, but they do not provide actionable guidance on how to actualize these capabilities. Meanwhile, recent papers explored design methods supporting the creation of explainable AI systems. Wolf [wolf2019explainability] proposed a scenario-based approach to identify user needs for explainability (“what people might need to understand about AI systems”) early on in system development. Eiband et al. [eiband2018bringing] proposed a stage-based participatory design process, which guides product-specific needs specification–what to explain, followed by iterative design of solutions–how to explain.
Our work is motivated by a similar pragmatic goal of supporting design practices of XAI. More specifically, in the face of increasingly available XAI techniques, we are interested in the design work that connect user needs and these technical capabilities. In particular, we probe the challenges to identify the suitability of XAI techniques. A recent stream of guidance in the public domain (e.g. [dalex, h2o, arya2019one]) on how to select among XAI algorithms suggest their suitability can be difficult to determine. More problematically, these guidelines are targeting data scientists, using criteria grounded in the development process (e.g., explaining data or features, by pre- or post-training). They do not directly address end user needs for understanding AI, nor support the navigation of the design space of XAI.
2.3 Question-driven explanation
Outside the ML field, many explored the space of user needs for explanation using a question-driven framework. Fundamentally, an explanation is “an answer to a (why-) question [miller2018explanation].” These questions are also user and context dependent, described as “triggers” by Hoffman et al. [hoffman2018metrics] representing “tacitly an expression of a need for a certain kind of explanation…to satisfy certain user purposes of user goals. ”
In the early generation of AI work, question-driven frameworks were used to generate explanations for knowledge-based systems [chandrasekaran1989explaining, gregor1999explanations, swartout1987making]. One notable work is AQUA [ram1989question]
, a reasoning model that uses questions to drive the generation of explanations and identify knowledge gaps for learning. AQUA was built upon a taxonomy of questions for explanations, including anomaly detection questions, hypothesis verification questions, etc. Silveira et al. provided a taxonomy of user questions about software to drive the design of help systems[silveira2001semiotic]. Building on it, Glass et al. [glass2008toward] investigated users’ explanation requirements in using an adaptive agent and showed that user needs for different types of explanation vary. In context-aware computing, Lim and Dey [lim2009assessing] developed a taxonomy of user needs for intelligibility by crowdsourcing user questions in multiple scenarios of context-aware applications. These questions were coded into intelligibility types, including input, output, conceptual model (why, how, why not, what else, what if) and non-functional types (certainty, control). This taxonomy enabled a toolkit for intelligibility [lim2010toolkit] that supports the generation of explanations for context-aware applications.
Inspired by the prior work, we use an XAI question bank, containing prototypical questions that users may ask for understanding AI systems, as a study probe representing user needs for AI explanability. Instead of using question taxonomies that are not specific to ML, we start by performing a literature review to arrive at a taxonomy of existing XAI techniques, and use it to guide the creation of user questions. Thereby we constrain the probe to reflect the current availability of XAI techniques, to understand how user needs for such explainability are presented in real-world AI products.
3 XAI question bank
We now describe how we developed the XAI question bank by first identifying a list of explanation methods supported by current XAI algorithms, for which we focus on those generating post-hoc explanations for opaque ML models [arrieta2019explainable, guidotti2019survey]. For the scope of this paper, we will leave out the technical details of the algorithms but provide references for interested readers. There have been many efforts to create taxonomies of XAI methods [adadi2018peeking, arrieta2019explainable, carvalho2019machine, gilpin2018explaining, guidotti2019survey, mohseni2018survey, molnar2018interpretable, ras2018explanation, samek2019towards]. Commonly, they differentiate between an explanation method–a pattern or mechanism to explain ML models–and specific XAI algorithms. One type of explanation method can be generated by multiple algorithms, which may vary in performance or applicability to specific ML models. Common dimensions to categorize explanation methods include: 1) The scope of the explanation, i.e. whether to support understanding the entire model (global) versus a single prediction (local); 2) The complexity of the model to be explained; 3) The dependency on the model used, i.e., whether the technique applies to any type of ML model or to only one type [adadi2018peeking]; and 4) The stage of model development to apply the explanation [carvalho2019machine].
Except for the first one, these dimensions are data scientist centric as they are concerned with the characteristics of the underlying model. For our purpose of mapping user questions, we seek a taxonomy that lists the forms of explanation as presented to users. For example, to explain how a model works, we disregard the complexity of the model, but instead differentiate between methods that describe the model logic as rules, decision trees or feature importance. Also, to identify user questions an explanation addresses, we believe it is sufficient to stay at the general mechanism, and ignore the specificity of presentation, such as whether feature importance is presented as texts or visualization [ribeiro2016should]. Guided by these principles, we found the taxonomy of explanators in Guidotti et al. [guidotti2019survey] closest to our purpose. Using it as a starting point, we consulted other survey papers and iteratively consolidated a taxonomy of explanation methods. In addition to the three categories in [guidotti2019survey]–methods that explain the entire model (global), an individual outcome (local), and inspect how the output changes with instance changes (inspect counterfactual), we added example based explanations [hohman2019gamut, molnar2018interpretable], since they represent a distinct mechanism to explain. Finally, we arrived at the taxonomy presented in the second column of Table 1.
|Category of Methods||Explanation Method||Definition||Algorithm Examples||Question Type|
|Explain the model||Global feature importance||Describe the weights of the features used by the model (including visualization that shows the weights of high-level features)||[henelius2014peek, lou2013accurate, nguyen2016multifaceted, tolomei2017interpretable]||How|
|(Global)||Decision tree approximation||Approximate the model to an interpretable decision-tree||[gibbons2013cad, johansson2009evolving, krishnan1999extracting]||How, Why, Why not, What if|
|Rule extraction||Approximate the model to a set of rules, e.g., if-then rules||[augasta2012reverse, doshi2017towards, zhou2003extracting]||How, Why, Why not, What if|
|Explain a prediction||Local feature importance and saliency method||Show how the features of the instance contribute to the model’s prediction (including causes in parts of an image or text)||[lundberg2017unified, simonyan2013deep, ribeiro2016should, vstrumbelj2014explaining, zhou2016learning]||Why|
|(Local)||Local rules or trees||Describe the rules or a decision-tree path that the instance fits to guarantee the prediction||[guidotti2018local, ribeiro2018anchors, singh2016programs]||Why, How to still be this|
|Inspect counterfactual||Feature influence or relevance method||Show how prediction changes corresponding to changes of a feature (often in a visualization format)||[apley2016visualizing, goldstein2015peeking, friedman2001greedy, krause2016interacting]||What if, How to be that, How to still be this|
|Contrastive or sensitive features||Describe features that will change the prediction if perturbed or absent||[dhurandhar2018explanations, zhang2018interpreting]||Why, Why not, How to be that|
|Example based||Prototypical or representative examples||Provide example(s) similar to the instance and with the same record as the prediction||[bien2011prototype, kim2014bayesian, koh2017understanding]||Why, How to still be this|
|Counterfactual example||Provide example(s) with small differences from the instance but with a different record from the prediction||[laugel2017inverse, mothilal2019explaining, wachter2017counterfactual]||Why, Why not, How to be that|
To map the explaination methods to user questions they can answer, we consulted prior work that provided taxonomies of questions for explanations [lim2009assessing, lim2010toolkit, ram1989question, silveira2001semiotic]. The closest to our purpose is the intelligibility types by Lim et al. [lim2009assessing, lim2010toolkit], developed by eliciting user questions in scenarios of context-aware computing. In particular, the intelligibility types of How (system logic), Why (a prediction), Why not, What if are directly applicable to ML systems. By mapping these questions to explanation methods listed in Table 1, we identified two additional types of question that can be addressed by existing XAI techniques: 1) How to be that: what are the changes required, often implying minimum changes, for an instance to get a different targeted prediction; 2) How to still be this: what are the permitted changes, often implying maximum changes, for an instance to still get the same prediction. We note that the questions of What if, How to be… are considered counterfactual questions and best answered by inspection or example based explanations, which allow users to understand the decision boundaries of a ML model. Table 1 was reviewed by 4 additional experts working in the field of XAI.
Taking a broad view on explainability, we also consider descriptive information that could make a ML model more transparent. We added three more types based on [hohman2019gamut, lim2009assessing, lim2010toolkit]– questions regarding model Input (training data), Output, and Performance. In the rest of the paper, we refer these 9 types of questions as 9 explainability needs categories as they represent categories of prototypical questions users may ask to understand AI. For each category, we created a leading question (e.g.,“Why is this instance given this prediction” for the Why category111We instructed that ‘prediction’ is used to refer the AI output for an instance. In the context of a product, it can mean a score/ recommendation/ classification/ answer, etc.), and supplemented 2-3 additional example questions, inquiring about features and examples whenever applies. The list of questions developed in this step are shown in Figure 1 without an asterisk. We do not claim the exhaustiveness of this list, but deem it to be sufficient as a study probe.
4 Study design
We conducted semi-structured interviews with 20 UX and design practitioners recruited from multiple product lines at IBM. All but two (I-6 and I-20) informants worked on different products without shared AI models. Three informants were design team leads overseeing multiple products. The AI products included mature commercial platforms, systems in the testing phase, and internal platforms used by IBM employees. 50% of informants were female. All but two were based in the United States, in 7 different locations. Table 2 summarizes the primary areas of the products and informants’ job titles. Our samples focused on AI systems that support utility tasks such as analytics and decision-support, as explainability is critical for high-stakes tasks where people would want to understand the AI’s decisions [carvalho2019machine, doshi2017towards]. Informants were recruited from internal chat groups relevant to design and UX of AI products. The recruiting criteria indicated that one should have worked on the design of an AI product and had a good understanding of its users, and mentioned that the interview would focus on user needs around understanding the AI.
|Technology area||Job title||Informant IDs|
|Business decision support||HCI researcher, Designer, Designer, Data scientist, User researcher||I-4, I-5, I-12, I-17, I-19|
|Medical analytic or decision support||Product lead, Design researcher, Design researcher, Designer, Design researcher||I-1, I-6, I-7, I-11, I-20|
|AI model training or customization tools||Designer, Project manager, Designer||I-10, I-14, I-15|
|Human resource support||Designer||I-3|
|Enterprise social||User researcher||I-9|
|Natural resource analytic||UX researcher||I-2|
|Customer care chatbot||UX researcher||I-13|
|Muliple areas||Design team leads||I-8, I-16, I-18|
We noticed that the current status of explainability in commercial AI products vary–about two thirds of the products (68.8%) have descriptive explanations about the data or algorithm, only a subset (37.5%) provide explanations for individual decisions, and certain products (e.g., chatbot) have neither. To explore the design space of XAI, we were interested in user needs for explainability uncovered by the design practitioners instead of the current system capabilities. The XAI question bank was able to scaffold the discussions, both to enumerate on the explainability needs categories, and to ground the discussion on user questions instead of venturing into the technical details.
Using MURAL–a visual collaboration tool, we created a card for each question category listed in Figure 1, with the leading and example questions (without an asterisk). Informant went through each card and discussed whether they encountered these questions from users; If not, we asked whether they saw the questions would apply and in what situations. After pilot testing, for efficiency, we combined the Why and Why not into one card to represent user needs to understand a prediction; and What if, How to be that, How to still be this into one card to represent user needs to understand counterfactual cases. Thus there were 6 cards plus a blank card if one wanted to create an additional category. If time permitted, we asked informants to sort the cards according to their priority to address, and elicited the reasons for the ordering.
Interviews lasted 45-60 minutes, conducted remotely using a video conferencing system and MURAL. We started by asking informants to pick an AI product they worked on and had good knowledge of the users, in which they saw user needs for understanding the AI. We asked them to describe the system and the AI components. They could either use screen sharing or send screenshots to show us the system. We then asked whether the users had needs to understand the AI, and probed on why, when and where they had such needs (or lack thereof), and how the needs could be addressed, currently or speculatively. We then asked informants to reflect on what questions users would ask about the AI and listed as many as they could. User questions were also added to MURAL by the researchers if they appeared in other parts of the discussion. Thereby, we gathered user questions in a bottom-up fashion that allowed us to identify gaps in the algorithm-informed XAI question bank. It also prepared informants to move to discussions around the question cards. We closed the interview by asking informants to reflect on common challenges to build explainability features in AI products, and what kind of support they wished to have. For the three informants on lead roles, we focused on discussing the general status of explainability in AI products.
Around 1000 minutes of interviews were recorded and transcribed, from which we extracted 607 passages broadly relevant to explainability. We performed open coding and axial coding on these passages as informed by Grounded Theory research [corbin2015basics]. The iterative coding was conducted by one researcher, with frequent discussions with the other researchers. We returned to the passages, interview video and the AI products repeatedly as necessary. The iterative coding process resulted in a set of 24 axial codes. We combined them into selective codes to be discussed as the main themes in the results section, where the axial codes are presented in bold.
Two additional sets of code were applied: 1)We identified 170 user questions appeared in the question-listing activity and the rest of the interviews. 2)We coded these questions and other passages, wherever applied, with the explainability needs category. The intersection of the two sets of code was 124 covered questions, as covered by the categories of the question bank, and the remaining 46 uncovered questions.
To perform gap analysis on the XAI question bank, we followed two steps. For the covered questions in each needs category, we identified new forms of questions that were not covered by the original example questions, as shown with asterisks in Figure 1. By forms, we grouped together questions with the same intent but phrased differently. For example, “how was the data created”, and “where did the data come from” were both regarding the source of the training data, and covered by an original question in the Input category, while “what is the sample size” would be regarding a different characteristic of the input. In the second step, we examined the 46 uncovered questions. We first excluded 22 questions not generalizable to AI products, such as “what is the summary of the article?”. We then iteratively grouped and coded the intent of the remaining 24 questions and identified 5 additional forms of question in the Others category in Figure 1. Insights from the analysis will be discussed the results section.
The results are divided into two parts. We start by discussing the general themes emerged in the interviews around the design work to create explainable AI products, which highlight some of the gaps between the algorithmic perspective of XAI and the practices to address user needs to understand AI. We then discuss how each category of user needs for explainability is presented in real-world AI products and based on that, reflect on the opportunities and limitations of XAI work.
5.1 From XAI algorithms to design practices
5.1.1 The diverse motivations for and utility of explainability
The historical context for the surge of XAI can be attributed to a fear of lacking understanding and control on increasingly complex ML models. Explanation is often embraced as a cure for “black box” models to gain user trust and adoption. So a common pursuit is to produce interpretable, often simplified, descriptions of model logic to make an opaque ML model seen as transparent. This is a necessary effort, but insufficient to deliver a satisfying user experience if we ignore users’ motivation for explanations. As I-8 put: “Explainability isn’t just telling me how you get there, but also, can you expand on what you just told me…explanation has its own utility”.
We identified several utility goals driving user demands for explanations of AI. In the context of AI-assisted decision-making, explanations are most frequently sought to gain further insights or evidence, as users are not satisfied by merely seeing a recommendation or score given by the AI. There are several ways people use these insights. When seeing disagreeable, unexpected or unfamiliar output, explanations are critical for people to assess the AI’s judgment to make an informed decision. Even when users’ decision aligns with the AI’s, explanations could help enhance decision confidence or generate hypothesis about the causality for follow-up actions, as illustrated by I-5’s comment, who worked on a tool supporting supply chain management: “users need to know why the system is saying this will be late because the reason is going to determine what their next action is…If it’s because of a weather event, so no matter what you do you’re not going to improve this number, versus something small,if you just make a quick call, you can get that number down.” In some cases, users also deem explanations of the AI’s decision as potential mitigation of their own decision biases.
To appropriately evaluate the capability of the AI system is identified as the second theme of motivation, both to determine the overall system adoption (e.g., evaluating data quality, transferability to a new context, whether the model logic aligns with domain knowledge), and at the operational level to beware of the system’s limitations. I-6 commented on why explanations matter for users of a medical imaging system: “There is a calibration of trust, whether people will use it over time. But also saying hey, we know this fails in this way.” We note that appropriating trust should be distinguished from enhancing trust. Though from a product team’s perspective, the concern is often on users’ under-trusting of the AI system and explanations are sought to improve adoption.
The third theme of motivation for explainability is to adapt usage or interaction behaviors to better utilize the AI. I-7 described users’ desire to understand how the AI extracted information from clinic notes so they could adapt their notes-taking practices. I-17 mentioned users of a sales inventory management tool would want to focus on cases where the AI prediction was likely to err. I-13 commented that explanation could suggest to chatbot users what kind of things they could ask. Furthermore, explanations could also convince users to invest in the system “if they know how the system will improve”(I-11) (e.g., access to personal information, feedback).
Several informants working on AI systems supporting analysts’ work or model training tools considered explanations as an integral part of a “feedback loop” (I-11) to improve AI performance. Such needs are not only seen in debugging tools [hohman2019gamut], but also in cases where the user could manipulate the data or correct the instance: “[Explaining] why it thinks we are where we are and the opportunity to say, ‘no, I need you to just understand that we’re in Phase 2”’ (I-5).
Last but not least, informants reflected on their ethical responsibilities to provide explanations:“ What are we responsible for as creators of tools… whether it’s out of the kindness of our hearts or whether it’s because there’s a true risk to others or society… We have to provide that level of explainability”(I-8).
While some of these motivations have been discussed conceptually in prior work [adadi2018peeking, arrieta2019explainable, carvalho2019machine, doshi2017towards, guidotti2019survey], our study provided concrete examples in real-world AI products. It is worth noting that the motivation for explainability is grounded in users’ system goals such as improving decision-making, so explanation is not merely to provide transparency but support downstream user actions such as adapting interactions or acting on the decisions. Unpacking the motivation and downstream user actions should ultimately guide the selection of explanation methods. For example, if the goal is to gain further insights, example-based explanations could be more useful than feature-based explanations that describe the algorithm’s logic, as I-2 described: “[Users of natural resource analytic tools] already rely a lot on the analogy… [Similar examples] are very good to their study, e.g. give clues on which year this was formed.”
Moreover, unpacking the motivation may help foresee the limitations of algorithmic explanations and fill the gaps in designing user experiences. For example, if the motivation is to mitigate biases, then users may desire to see “both positive and negative evidence” (I-1). If it is to support adaption of interaction, the system could supplement information of “this is what other people do” (I-13). Some informants criticized designing for the mental model of AI as a decision-maker explaining its rationale, and argued to focus on what utility explanations could provide to support users’ end goals. As I-1, who worked on a clinical decision-support tool, put:“[explanations by system rationale] are essentially ‘this is how I do it, take it or leave it’. But doctors don’t like this approach…Thinking that [AI is] giving treatment recommendations is the wrong place to start, because doctors know how to do it. It’s everything that happens around that decision they need help with… more discussions about the output, rather than how you get there.”
5.1.2 In quest for human-like explanations
Explanation is an integral part of human communication, and invites preconception of what constitutes effective and natural ways to explain. Informants were constantly dealing with discrepancies between algorithmic and human explanations. Some of the discrepancies are inherent to the mechanisms of AI algorithms, such as using features or learned patterns that are not aligned with how people make decisions. Others are in the forms constrained by technical capabilities and foreign to human interactions. For example: “People have this unspoken norm, I trust you and if you are not sure you would let me know. But nobody goes around saying how confident they are in the thing that they’re saying. It may be implied in the language they’re using. So a system has high precision but 67% confidence…it is a stupid and hard to use metric ” (I-5).
Several informants attempted to mimic how people, especially domain experts, explain in their design work. By aligning with how humans explain, it aligns user perception of the AI with existing mental model of decision-making in the domain, suggested by prior work as critical to build trust [springer2019progressive]. This is best exemplified by I-1’s work in designing explanations for a clinical-decision support system that performs information extraction from medical literature: “We mirror the way a doctor would do. So if a doctor was asked, how would you go and find the evidence? … You went to PubMed you found a paper, the paper matches my patient… you’re showing me the statements in the paper [on whether] it was a good or bad idea, and putting all that together…So when you manage to reflect with AI literally how the doctor thinks about the problem, the black box kind of disappears.”
We identified several themes on the desirable properties of human explanation that echoed Miller’s arguments on informing XAI with how humans explain [miller2018explanation]. First, explanations are selected, often focusing on one or two causes from a sometimes infinite number of causes [miller2018explanation]. Informants discussed the importance of selectivity as “a balance of providing enough information that is trustworthy and compelling without overwhelming the user” (I-11) and acknowledging that “AI will have a degree of [randomness] and may not be 100% explainable” (I-8). Second, explanations are social, as part of a conversation and presented based on the recipient’s current beliefs [miller2018explanation]. This social aspect is not only seen in tailoring explanations “for people with different backgrounds” (I-4), but also in accommodating the evolving needs for explanation as one builds understanding and trust during the interaction process, “once we trust it, it’s about going deeper into it, the kind of questions goes from broad to ultra-specific ” (I-8).
The selective and social nature of explanation has made many to argue that XAI has to be interactive or even conversational [madumal2019grounded, miller2018explanation, weld2018challenge], tailoring explanation to different questions asked by different users, who would also ask follow-up questions to keep closing the gap of understanding, a process known as grounding in human communication [clark1991grounding]. Following prior work [lim2009assessing, lim2010toolkit, ram1989question, weld2018challenge], we postulate that a question-driven framework provides a viable path to interactive explanations.
5.1.3 XAI: challenges and needs in design practices
It is challenging work to create design solutions that bridge user needs for explainability and technical capabilities. The satisfaction of user needs is frequently hampered by the current availability of XAI techniques, which we will discuss in detail in the next section. Informants also had to work with other product goals that are at odds with explainability
product goals that are at odds with explainability, such as protecting proprietary data, avoiding legal or marketing concerns from exposing the AI algorithm. Sometimes, explainability presents challenges to other aspects of user experience –“any opportunities we have to give them more explainability comes at the cost of the seamless integration. And [doctors] are just so clear that not breaking their workflow is the most important factor to their happiness”(I-6). Or, it might expose corner cases or rationales that some individuals found “wrong” (I-2), “over-simplified” (I-9), or “outdated” (I-5), resulting in unnecessary user aversion and making the product “victim of trying to be too transparent” (I-2).
In addition, unlike XAI algorithmic work’s focus on one or a class of AI algorithm, creating explainable AI products requires a holistic approach that links multiple algorithms or system components to accommodate users’ goal of better understanding and interacting with the system, as described by I-4: “There is the traditional what we think about XAI, explaining what the model is doing. But there is this huge wrapper or the situation around it that people are really uncertain… what do I need to do with this output, how do I incorporate it into other processes, other tools? So it is thinking about it as part of complex systems rather than one model and one output.”
In short, inherent tension often exits between explainability and other system and business goals. Design practitioners often act as the advocate for explainability, but the realization requires teamwork with data scientists, developers and other stakeholders. Their advocacy is often hindered by skill gaps to engage themselves and the team in “finding the right pairing to put the ideas of what’s right for the user together with what’s doable given the tools or the algorithms that they’re using”(I-8), and the cost of time and resource that a product team may be unwilling to invest with a release schedule. These challenges can potentially be overcome by having design support that helps sensitizing designers to the design opportunities in algorithmic explanations [yang2018machine] and enable conversations with the rest of the product team, as expressed by many informants. We summarize informants’ comments on desirable support for designing XAI in two areas:
Guidance for explainability needs specification
, for which we saw requests for both: 1) general principles of what types of explainability should be provided, as heuristic guidelines that a product can be developed or evaluated with; 2) guidance to identify product, user, and context specific needs to help the product team prioritize the effort.
Guidance for creating explainability solutions to address user needs, paired with example artifacts (e.g., UI elements, design patterns), to support the exploration of tangible solutions and communication with developers and stakeholders.
We note that the two areas correspond to the what to explain and how to explain stages in Eiband et al.’s design process for transparent interfaces [eiband2018bringing]. We argue that the question bank could potentially support needs specification work, as it essentially lays out the space of users’ prototypical questions to understand AI systems. For example, it may guide an AI system to be equipped with answers to common user questions. The above requests suggest the needs to further understand the key factors that may lead to the variability of user questions, and how these questions should be appropriately answered. We work towards these goals in the next section.
5.2 Understanding user needs for explainability
We use the explainability needs category codes to guide our analysis on each category. We focus on two questions: 1) The variability of the explainability needs, i.e., what factors make a category of user questions more or less likely to be asked. 2) The potential gaps between algorithmic explanations and user needs, by examining passages coded as design challenge, and the additional questions identified in the gap analysis (Figure 1). To help answer the former, we first discuss key factors that may lead to the variability of explainability needs, which we identified by coding informants’ reasons to include, exclude or prioritize a needs category.
Motivation for explainability:The diverse motivations discussed in the last section for demanding explainability could lead to wanting different kinds of explanation.
Usage point: Informants mentioned common points during the usage of AI systems where certain type of explianabiltiy was of particular interest, including on-boarding, reliance or delegation to AI, facing abnormal results, system breakdown, and seeing changes in the system.
Algorithm or data type: Different algorithms invoke different questions. For lay users, it might be more relevant to consider the type of data the AI is used with rather than specific algorithms, e.g., tabular data, text, images or video.
Decision context: We identified codes describing the nature of the decision context that led to prominent needs for certain type of explainability, including outcome criticality, time-sensitivity, and decision complexity.
User type: Codes describing the characteristics of users include AI knowledge, domain knowledge, attitude towards AI, roles or responsibilities.
In prior work, the variability of user needs for explaianability has been discussed regarding the roles of the users [arrieta2019explainable, arya2019one, hind2019explaining, samek2019towards, weller2017challenges], e.g., regulators, model developers, managers assessing the tool, decision-makers, consumers. The diverse criteria used by our informants suggest many other factors to consider for the suitability of XAI techniques. This paper does not conclude on how these factors vary user needs. Rather, they should be seen as sensitizing concepts by Bowen’s [bowen2006grounded] and Ribes’s [ribes2017notes] definitions–“tell where to look but not what to see.” The sheer number of these factors highlight the challenge to pre-define users’ explainability needs, vindicating the recent effort [eiband2018bringing, wolf2019explainability] to provide structured guidance to support empirically identifying application-specific user needs. Below we present informants’ discussions on each category of explainability needs and highlight how these factors heighten these needs (in italic).
Understanding training data for the AI model was most frequently seen to serve the motivation to appropriately evaluate AI capabilities for use. It was considered a prominent need during the on-boarding stage, and by both the decision-makers and people in quality-control roles. Explanations of data were also important in cases where the users could directly manipulate the data to either adapt the usage to better utilize the AI or to improve the AI performance.
Additional questions identified from the gap analysis indicate a desire to gauge the AI’s limitations by inquiring about the sample size, potential biases, sampling of critical sub-groups and missing data. Additional codes include to understand the system’s compliance with regulations regarding data sampling, and transferability of the AI model: “Not necessarily source, but more conceptual like…[are we] making the solutions based on what occurred yesterday” (I-4). These patterns imply that users demand comprehensive transparency of training data, especially the limitations.
While understanding the output is often an neglected aspect in algorithmic work of XAI, we saw frequent questions on it, indicating users’ desire to understand the value of the AI system to appropriately evaluate the capability and to better utilize the AI, often in the on-boarding stage or dealing with complex decisions. Explaining output and explaining input/data were considered as “static explanations” that more likely come up in the early stage of system usage, instead of frequent “day-to-day, or transaction-to-transaction interactions” (I-8).
The most frequently asked questions were not regarding descriptive information of the algorithmic output, but at a high level, inquiring how to best utilize the output. We also identified two additional questions–“the scope of the capability”, and “how the output impacts other system components.” To address such user needs requires contextualizing explaining the system’s output in downstream system tasks and the users’ overall workflow.
To our surprise, the performance category was repeatedly ranked at the bottom, especially for users without AI background and in decision contexts considered less critical. There was a common hesitation among informants to present ML performance metrics such as accuracy, not only because a numerical value could be hard to interpret by lay users, but also there may be discrepancy between performances on the test data and the actual data, creating different “experienced accuracy” [yin2019understanding] that might deter users. Some also believed that small differences in these metrics would not change how users interact: “Technically that’s great, but, it’s still not a hundred… there’s always going to be work that the users have to do to verify or double check” (I-4).
As many informants pointed out, and suggested by the additional questions, the goal of explaining performance should be to help users understand the limitations of the AI, and make it actionable as to answer “Is the performance good enough for….” There are constraints of technical capabilities. For example, confidence scores were repeatedly dismissed as not providing enough actionability –“[users] struggle to really understand, does it mean it’s going to do what I want it to do, or, can I trust it? ” (I-15). Regarding the additional question “What kind of mistake does it make”, informants mentioned the precision-recall trade-off is a deliberately decided limitation that should be explained as it might change users’ course of actions [kocielnik2019will] :“It’s use case dependent… for the [doctors] if they miss a tumor, that’s a life changing. So they have a very high tolerance for false positives ”(I-7).
5.2.4 How–global model
Informants recognized the importance of providing global explanations on how the AI made decisions, both to help users appropriately evaluate the system capabilities, and build a mental model to better interact with or improve the system. Such needs were prominent in cases where users were in a quality-control role, or in a position able to adjust the model or the data-collection process: “The company really care about which of these attributes are the most important… then they will forward the manufacturer to include those in the data” (I-17). Informants also agreed that users with AI or analytic background were more likely to seek global explanations.
As Table 1 shows, to answer the How question, XAI algorithms commonly employ ranked features, decision trees or rules. However, some informants were referring to high-level descriptions, such as “I would just say keywords matching, it is intuitive, and it’s been around” (I-3). Some were also concerned about fitting a complete How explanation into the users’ workflow: “ I can’t imagine [doctors are] going into their workflow and be like, I’m so busy, let me read more about this AI. But, they would probably want some kind of confirmation about how it makes decisions
I can’t imagine [doctors are] going into their workflow and be like, I’m so busy, let me read more about this AI. But, they would probably want some kind of confirmation about how it makes decisions” (I-11). So the design challenge is to identify the appropriate level of details to explain the model globally. This challenge is reflected in the question bank as well. While most XAI methods focus on answering “What is the overall logic”, we discovered that many questions were simply asking about the top features or whether certain feature was used, meanwhile a small set of questions by users with AI background were regarding the technical details of the model.
5.2.5 Why, Why not–local prediction
Understanding a particular decision was often ranked at the top, and in user questions mentioned in all products. These questions were naturally raised after a surprising or abnormal event: “For everyday interactions, most likely it’s how did the system give me this answer? Not just any answer, but all of a sudden, here’s this thing that I’m [not expecting] seeing” (I-8). This pattern is pointed out by Miller [miller2018explanation] as the contrastive nature of human explanations, which are often implying Why not the expected event. We observed a shared struggle with available technical solutions answering Why but not Why not. Several informants working on text-based ML commented on the inadequacy of the common approach by highlighting keywords that contribute to the prediction:“even though we explained conceptually how it’s working, it wouldn’t be able to explain that error. So it would actually be counter-intuitive why it should make that error” (I-4). I-17 discussed the limitation of a state-of-the-art explanation algorithm, LIME [robnik2018perturbation], which generates feature importance for “black-box” ML models. She found the static explanation to be unsatisfying: “LIME would say ‘it is boot cut which is why [it’s not going to sell]’, but would it be different if it was a skinny cut?”
Many current XAI algorithms focus on the Why question. We note that a challenge for algorithmic explanations is that the contrastive outcome is often not explicitly available to the model. These observations again suggest the benefit of interactive explanations, allowing users to explicitly reference the contrastive outcome and asking follow-up What if questions.
5.2.6 What if, How to be–inspecting counterfactual
This category of explainability needs was not ranked high, and informants mentioned only 3 related user questions. Currently, these kinds of explanation are not widely adopted in commercial AI products. As prior work suggested, awareness of new types of explainability could change user demand [lim2009assessing]. In fact, informants recognized its potential utility as system features to test different scenarios for users to gain further insights for the decision, and to understand the boundary of system capabilities to enable adapting interaction behaviors. Informants also identified that such features align with how data scientists currently debug to improve ML models. For example, I-4 was excited to consider how What if explanations might support supply chain managers make decisions–“you can run different scenarios… the system can make an initial recommendation and then they can tweak it to see, the impact on the cost after that.” I-13 speculated that How to be that explanations (how the chatbot would understand differently) could help chatbot users better phrase their queries. I-15 working on a tool for customizing entity recognition models commented that seeing how instance changes impact the output could help users debug the training data.
As seen in Table 1, there is a growing collection of XAI techniques addressing the counterfactual questions. However, currently the feature influence methods are mostly used in data science tools [hohman2019gamut]. Contrastive feature and example based methods are relatively new areas of XAI work [dhurandhar2018explanations, wachter2017counterfactual, zhang2018interpreting]. Our results suggest their potentials as utility features in a broad spectrum of AI products. Future work should explore these potentials and sensitize practitioners to these possibilities.
5.2.7 Additional explainability needs
We also identified a set of questions that were not covered by the algorithm-informed needs categories. They point to some additional areas of interest that users have for understanding AI. One area is understanding the change and adaption of AI, both in terms of the possible changes in the system, and how users can change the system. Other areas are follow-up questions by further inquiring why certain features or data is used, and terminological questions such as “what do you mean by…”, both of which may naturally emerge in an interactive explanation paradigm. Lastly, some users might be interested in knowing other people’s usage of the system, which suggests a new mechanism for an AI system to provide social explanations with regard to other users’ actions.
With widespread calls for transparent and responsible AI, industry practitioners are eagerly taking up the ideas and solutions from the XAI literature. However, despite recent effort toward a scientific understanding of human-AI interaction [dodge2019explaining, narayanan2018humans, zhu2018explainable], XAI research is still struggling with a lack of understanding of real-world user needs for AI transparency, and by far little consideration of what practitioners need to create explainable AI products. By focusing on design practitioners, a role in particular connecting user needs and technical solutions, our study suggests the following directions both for algorithmic work to close the gaps addressing user needs for explainability, and design support to reduce technical and practical barriers to create user-friendly XAI products.
XAI research should direct its attention to techniques that address user needs, and we suggest a question-driven framework to embody these needs. Our results point to a few common questions and their desired answers that future work of XAI should explore, for example, How question answered by multi-level details describing the algorithm, Why not question referencing an expected alternative outcome, and How will the system change. Considering the coverage of user questions, especially common questions and new questions identified, could help the community move toward more human-centric effort. The question bank presented in this paper is just a starting point. Future work could continue building the repository by directly eliciting questions from end users of different types of AI systems [lim2009assessing].
Even with the availability of open-source libraries of XAI algorithms, practitioners struggle with the gaps between algorithmic output and creating consumable, human-like explanations. To close the gap requires inter-disciplinary work that studies how humans explain, and formalizes the patterns in an algorithmic form. Such a practice has already been engaged in interactive ML [amershi2014power, stumpf2007toward] and "socially guided machine learning" [thomaz2006transparency]. Prior work repeatedly pointed out that a prerequisite for explanations to be truly human-like is to be interactive [madumal2019grounded, miller2018explanation, weld2018challenge], because explanation is a grounding process where people incrementally close the belief gaps. Indeed, our study found that some user questions are closely connected with or followed by other questions. Future work could explore interaction protocols, for example through statistical modeling of how humans ask different explanation-seeking questions [madumal2019grounded], to drive the flow of interactive or conversational explanations.
Our study revealed the variability of user questions and its complex mechanisms, highlighting the challenge to identify product-specific user needs. While prior work attempted at top-down descriptions of the needs of users in different roles [arrieta2019explainable, arya2019one, hind2019explaining, samek2019towards, weller2017challenges], it may not be sufficient for design work that has to consider specific actions, usage points, models, etc. Recent HCI work on XAI encourages empirically identifying user needs with structured procedures [eiband2018bringing, wolf2019explainability]. Following this trend, we suggest several ways the XAI question bank can be used for needs specification. First as heuristic guidance, a product team could enumerate on whether each of the question categories has been addressed, or which questions should be prioritized. Second, it can be used in user research to scaffold the elicitation of user needs. For example, card-sorting exercises of the questions can be performed with users (some adaptation may be required for specific AI applications). We invite practitioners to use, revise and expand the XAI question bank.
The technical barriers for designers, and practitioners in general, to navigate the space of XAI remains a primary challenge for product teams to optimize XAI user experiences. To support design work for ML, Yang suggested research opportunities to “sensitize designers to the breadth of ML capabilities” [yang2018machine]. Informants also expressed strong desire for support of technical discussions with data scientists and stakeholders, as mitigating the friction and time cost is critical for the success of their advocacy for explainability. One opportunity for such a sensitizing support is to create concrete mapping between user questions and algorithmic capabilities, serving as a shared cognitive artifact between the designer and data scientist. One example, perhaps over-simplified, is the taxonomy of XAI methods we presented in Table 1. We may envision a question-driven design process: By user research, a design practitioner recognizes that the question How to be that needs to be answered, and from Table 1 identifies the explanation method contrastive feature is the most appropriate choice. Together with a data scientist, the team find the suitable solution to implement from the list of suggested algorithms. By suggesting conceptually this question-driven design process, we invite the research community to develop more fine-grained frameworks of XAI features (e.g., considering UI formats and patterns) that connect user questions and technical availability.
First of all, the collection of user questions were explored through design practitioners instead of end users, so we cannot claim this is a complete analysis of user needs for explainability. The results only reflect design practitioners’ views. Future work could study other roles involved in AI product development, such as data scientists, to better understand the challenges to create XAI products. Our product samples focus on ones supporting high-stakes tasks, where needs for explainability might have been greater, and the current status of XAI more advanced, than leisurely used AI products. We do not claim the completeness of the XAI methods discussed, especially as this is a fast advancing research field. Practitioners’ increasing accessibility to XAI techniques may also change the demands and concerns expressed in the study. Finally, our informants worked for the same organization. Although this is not uncommon for studies of practitioners [amershi2019guidelines, erickson2008assistance, muller2019data] and we recruited informants from diverse product lines and locations, we acknowledge that design practices may be different in other companies or organizations.
Although the research field of XAI is experiencing exponential growth, there is little shared practices of designing user-friendly explainable AI applications. We take the position that the suitability of explanations is question dependent and requires an understanding of user questions for a specific AI application. We develop an XAI question bank to bridge the space of user needs for AI explainability and technical capabilities of XAI work. Using it as a study probe, we explored together with industry design practitioners the opportunities and challenges in putting XAI techniques into practice. We illustrated the great variability of user questions that may subject to many motivational, contextual and individual factors. We also identified gaps between current algorithmic solutions of XAI and what’s needed to deliver satisfying user experiences, in the types of user questions to address and how they are addressed. We join many others in this field advocating a user-centered approach to XAI [abdul2018trends, doshi2017towards, miller2018explanation, wang2019designing]. Our work suggests opportunities for the HCI and AI communities, as well as industry practitioners and academics, to work together to advance the field of XAI through translational work and shared knowledge repository that maps between user needs for explainability and XAI technical solutions.
We thank all our anonymous participants. We also thank Zahra Ashktorab, Rachel Bellamy, Amit Dhurandhar, Werner Geyer, Michael Hind, Stephanie Houde, David Millen, Michael Muller, Chenhao Tan, Richard Tomsett, Kush Varshney, Justin Weisz, Yunfeng Zhang, and anonymous CHI 2020 reviewers for their helpful feedback.