Log In Sign Up

Explaining Machine Learning Models in Natural Conversations: Towards a Conversational XAI Agent

The goal of Explainable AI (XAI) is to design methods to provide insights into the reasoning process of black-box models, such as deep neural networks, in order to explain them to humans. Social science research states that such explanations should be conversational, similar to human-to-human explanations. In this work, we show how to incorporate XAI in a conversational agent, using a standard design for the agent comprising natural language understanding and generation components. We build upon an XAI question bank which we extend by quality-controlled paraphrases to understand the user's information needs. We further systematically survey the literature for suitable explanation methods that provide the information to answer those questions, and present a comprehensive list of suggestions. Our work is the first step towards truly natural conversations about machine learning models with an explanation agent. The comprehensive list of XAI questions and the corresponding explanation methods may support other researchers in providing the necessary information to address users' demands.


Explainability via Responsibility

Procedural Content Generation via Machine Learning (PCGML) refers to a g...

Towards an Explanation Space to Align Humans and Explainable-AI Teamwork

Providing meaningful and actionable explanations to end-users is a funda...

Behaviour Trees for Conversational Explanation Experiences

Explainable AI (XAI) has the potential to make a significant impact on b...

How the Experts Do It: Assessing and Explaining Agent Behaviors in Real-Time Strategy Games

How should an AI-based explanation system explain an agent's complex beh...

Explanation from Specification

Explainable components in XAI algorithms often come from a familiar set ...

I Introduction

The de-facto standard in modern machine learning-based artificial intelligence (AI) systems are deep neural networks [31] having shown to outperform classical machine learning as well as humans on many tasks (e.g. [34]). However, there is a tradeoff between predictive performance and interpretability – while modern AI systems are highly powerful, they are also black-boxes by nature and their reasoning process is not interpretable anymore [15]. This tradeoff is addressed by the young, but rapidly growing research field of explainable artificial intelligence (XAI) [16].

While much progress has been made in XAI in terms of algorithmic development in recent years, this progress is not accessible to all stakeholders, in particular not to end users [1]. Miller [17] argues that human explanations are social conversations and in order for XAI to be truly interpretable and valuable for all stakeholders, XAI needs to be conversational as well. These conversations are not restricted to natural language, but explanations could also be presented in a visual manner. The maxim is to provide explanations in a conversational manner, with interactions based on questions and answers. While the need for conversational XAI has been recognized, fulfilling it remains an open challenge yet with solutions either limited to small, specific data sets and models [30] or still collecting requirements for conversational XAI [2]. In this work, we develop methods to leverage a standard conversational agent architecture to conversational XAI. Building upon well-established conversational agent techniques allows us to focus on XAI-specific requirements, in order to cover a broad range of questions, types of data, and types of models.

{},codes*= ] X-Agent: I recorded the profile: [Age 39, Workclass: State-gov, Education-num: 15, Marital-status: Separated, Occupation: Other-service, Relationship: Not-in-family, Race: White, Sex: Male, Capital-gain: 0, Capital-loss: 0, Hour-per-week: 20.0, Native-country: England].
With this profile, the income will be ¡=50K.

{},codes*= ] User: Give me the reason for this prediction!

{},codes*= ] X-Agent:

{},codes*= ] The above graph shows important features for this prediction. The red features increase the income, while the blue features decrease it.

{},codes*= ] User: Why is this profile predicted ¡=50K instead of ¿50K?

{},codes*= ] X-Agent: There are multiple reasons for this result:
Age is too low and Workclass is not suitable.
If Age increases to at least 66.3 and Workclass changes to Self-emp-inc, this person’s predicted income will be more than 50K.

{},codes*= ] User: That’s hard, how could I change only Occupation to get ¿50K prediction?

{},codes*= ] X-Agent: The Occupation should be changed to Tech-support to get an income of more than 50K.

Fig. 1:

Conversation example to explain a Random Forest classifier on the Adult dataset. Given a user XAI question, the prototype: i) preprocesses and matches the question to one of the reference questions in the phrase bank, ii) selects the XAI method to answer the reference question, and iii) generates a textual explanation based on the information from the selected XAI method.

Our target group is users without machine learning knowledge who interact with the system. Following the taxonomy by Tomset et al. [33], they could be executors or operators, e.g., doctors making decisions based on the system’s advice, or lenders using systems to assess the applicant’s profile. Fig. 1 shows an example of a conversation between our prototype agent and a user asking to explain a prediction of a Random Forest model on the Adult dataset.

Our approach (see Fig. 2 for an overview, full details in section III) is a systematic approach to enable XAI in a standard conversational agent architecture [29]. First, we construct a question phrase bank data set based on the question bank of Liao et al. [1]

to leverage the agent’s ability to understand XAI questions. Second, based on a systematic literature survey of explanation methods we establish a mapping between user intents, represented as reference questions in the question phrase bank, and XAI methods to answer each reference question in the bank. Finally, a template-based natural language generation component generates the answers using the information from selected XAI methods.

In detail, Liao et al. [1] collected a bank of 73 questions stakeholders may ask about an AI system and its reasoning process. However, this set contains only the core questions, without any variants or different phrases of the questions, which limits a conversational agent’s understanding. In order to account for higher diversity in questions, we augment the initial bank of prototypical questions by paraphrases. Furthermore, Liao et al. [1] suggest explanation methods per question group, but the exact suitable method for each specific question in the group remains unclear. Therefore, we conduct a systematic literature survey of explanation methods that provide the necessary information to answer each question and compile it into a comprehensive list of suggestions. This list serves as an internal mapping of XAI questions to corresponding answers in the dialog policy component of the conversational agent. Based on the question phrase bank and the internal mapping, we present a framework and implementation111 to incorporate XAI in a standard conversational agent that can provide answers to this set of questions and beyond. The agent’s output answer is generated by a natural language generation component, combining and enriching the explanation method’s output with additional information, such as feature and class names derived from user questions in the input. Leveraging a standard conversational agent architecture empowers full access to the tool suite of conversational agents. Our components are extensible, allowing to accommodate further interactions and additional user requirements. Specifically, our contributions are:

  • We incorporate XAI in a standard conversational agent framework and present a prototype that can communicate about the internals of the machine learning model in natural language.

  • We create a publicly available XAI question phrase bank for training the natural language understanding component of the conversational agent based on the XAI question bank of Liao et al. [1].

  • We present a systematic overview of methods to answer the information need implied in these questions and categorise questions by identifying which subsets require an XAI method for answering.

Reviewing related work in the next section, we note that requirements for conversational XAI have been collected, while means to addressing them are rare and incomplete. We describe the overall approach to incorporate XAI in each component of a standard conversational agent architecture to address these requirements in Section III. In Section IV, we detail our XAI specific components, describe our augmentation of the XAI question bank [1] to an XAI question phrase bank, then provide a comprehensive list of XAI questions and corresponding explanation methods in Section V. We detail how we generate answers, i.e., our implementation of the natural language generation component in Section VI. Finally, in Section VII

, we illustrate the broad coverage of our approach by two conversation scenarios, comprising tabular and image data and a Random Forest alongside a convolutional neural network.

Fig. 2: Overview of our approach to incorporate XAI in conversational agents’ components: 1) Question-Phrase-Generation (QPG) uses a paraphrase generation model on the questions from the XAI question bank [1]. The generated candidates are scored by multiple annotators and ranked, resulting in the XAI question phrase bank. 2) In the Natural Language Understanding (NLU) component, the reference question for a user question is retrieved from the phrase bank. 3) The intent of the reference question defines the XAI method to be applied to the model in the Question-XAI method mapping component (QX). 4) A natural language generation (NLG) component converts the output of the XAI method (e.g., a table, graph, or number) with an answer in natural language. Omitted for a better overview: data sets are loaded and machine learning models are trained dynamically on user requests.

Ii Related Work

In the following, we discuss related work in explainable AI (XAI), conversational agents, and works that combine both, i.e., conversational agents for explaining machine learning models.

Ii-a Explainable AI

XAI is a highly active research field from the algorithmic perspective. While the four key XAI surveys [15, 16, 3, 4] focus on different perspectives, a common pattern in their taxonomies is the distinction between intrinsically interpretable models, i.e., the model itself constitutes the explanation, and posthoc explanations, i.e., methods to explain black-box models. Posthoc explanations can be further classified as model-agnostic, i.e., applicable to any machine learning model, or model-specific, i.e., applicable only to certain types of models. Intrinsically interpretable models are model-specific by nature. We focus on model-agnostic explanations, in order to achieve broad coverage. We derive posthoc explanation methods, which are suitable to answer specific XAI questions from the aforementioned surveys, and additionally from the suggested methods for each group of XAI questions in [1, 6].

Ii-B Conversational Agents

With the success of deep learning, research in conversational agents has gradually shifted to end-to-end approaches, based on fully data-driven systems 

[35] in recent years. However, the abundance of data is an ultimate prerequisite of such systems. A pre-requisite XAI fails to meet, due to its recency. Contrary to end-to-end learning, the core elements of the Genial Understander System (GUS) [29] constitute a standard conversational agent architecture composed of multiple, individual components, which require only a tiny fraction of the amount of data needed for end-to-end systems. GUS is quite old, “but has been astonishingly long-lived and underlies most if not all modern commercial digital assistants” [22]. Therefore, our approach focuses on enabling XAI in this framework. Although GUS needs less data it still cannot do without data. Public data sets for conversational agents are available only for a few domains [7] and, to our knowledge, not for XAI. Hence, we create a question phrase corpus to fill this gap.

Ii-C Conversational XAI

Research on Conversational XAI is still in its infancy, and agents are strongly limited in scope. Werner [2] presents a work-in-progress prototype of a rule-based XAI chatbot to iteratively elaborate requirements, accompanied by findings from literature and user studies. Their prototype is limited to the classification of tabular data and a small set of pre-defined questions. Kuźba’s [30] goal is to collect the needs of users interacting with a conversational XAI agent, i.e., questions a human would ask. His prototype is limited to a Random Forest, applied to the Titanic Dataset, and explanations addressing a few types of XAI questions, since the primary goal is the collection of interaction data. Contrary to the data-driven approach of Kuźba, Liao et al. [1] construct an XAI question bank from literature review, expert reviews by XAI practitioners, and around 1000 minutes of interviews with 20 user experience and design practitioners working in multiple domains. This question bank contains 73 XAI questions in 10 categories (see Table III, columns 2 and 3) and serves as the basis for our approach. While the previous work collects users’ needs and questions to (conversational) XAI, our goal is to answer these questions.

Iii System overview

A standard conversational agent architecture [29] is generally composed of the following components: i) Natural Language Understanding (NLU) to, e.g., determine a user’s intent(s), ii) a dialogue state tracker (DST) to maintain the current state and history of the dialogue, iii) a dialogue policy, deciding the system’s next step, and iv) Natural Language Generation (NLG) to generate the system’s output. We integrate XAI into this architecture and assume that an XAI question contains all relevant information to select a proper explanation method as a response (the extension to incomplete information is subject to future work). Hence, we omit the DST. This section outlines our approach to incorporate XAI into the remaining components.

The general architecture of our XAI answering conversational agent is depicted in Fig. 2. To integrate XAI into the NLU component, we extend the XAI question bank [1] to an XAI question phrase bank. This extended question phrase bank constitutes a training data set for the NLU component in the conversational agent. We construct the extended question phrase bank from an initial seed of XAI questions, paraphrase generation, and scoring. We describe the construction (QPG) in full detail in Section Section IV-A. Upon a user question, the NLU component retrieves the corresponding reference question from the question phrase bank (see Section IV). The intent of the reference question defines the XAI method to be applied to the model in the Question-XAI method mapping, which is our integration XAI to the dialogue policy component (QX, see Section V). The natural language generation (NLG) component enriches the output of the XAI method, e.g., SHAP, by natural language to form the final answer, e.g. explaining SHAP’s output graph (see Section VI).

Iv Understanding Questions

To enable the conversational agent’s understanding of a broad set of XAI questions, we first generate a data set of question paraphrases based on the question bank by Liao et al. [1]. Then, we use this generated question phrase bank in the NLU component to match user intents to reference questions. In this section, we first describe the construction of the question phrase bank, then describe the reference question retrieval and finally report experimental results on the performance of two approaches to implement the NLU component.

Iv-a Question Phrase Bank (QPG)

Due to the lack of XAI data for the conversational agent, we introduce a XAI question phrase bank as a data set to train and evaluate the NLU component. The phrase bank represents possible user questions to XAI systems. This phrase bank is publicly available222

To generate our phrase bank with XAI questions, we build on the XAI question bank by Liao et al. [1] and extend their 73 original questions (see Table III

for examples) by paraphrases. Our goal is to capture more syntactic variance in the phrase bank while retaining semantics. To generate paraphrases for each question we use a paraphrase generation model, manually score the paraphrases for similarity by multiple annotators and filter the phrases by these scores (see Fig. 

2, QPG component, top right).

Iv-A1 Paraphrase Candidate Generation

We use GPT-3 [19] and few-shot learning to generate paraphrase candidates of XAI questions333We use the Open AI API: To finetune the model, we provide two examples of the original question and two paraphrases of this question to the GPT-3 paraphrase model for each question in the XAI question bank. Then, we prompt the model for paraphrases with a new question (see Fig. 3). During the paraphrase generation process, we ignore generated text that does obviously not constitute a paraphrased question (e.g., answers to questions, incomplete sentences). We generate at least 2 paraphrases per question, 4.2 on average, 20 at maximum, and 310 pairs of (question, potential paraphrase) in total.

Step 1, Finetuning:

Question: Why is this instance given the prediction?

Paraphrase1: Give me the reason for this result.

Paraphrase2: What is the reason for this prediction?

Question: What features does the system consider?

Paraphrase1: What attributes are used?

Paraphrase2: What features does the model use?

Step 2, Prompting:

Question: What is the sample size?

GPT-3 output:

Paraphrase1: How many did they sample?

Paraphrase2: How many items are considered in this result?

Fig. 3: Example GPT-3 finetuning, prompt and output to generate XAI paraphrase candidates

Iv-A2 Candidate Filtering

To assess the quality of the generated paraphrases, we annotate all generated pairs manually by human-perceived similarity. We extend our data set of generated paraphrases by 59 negative pairs, i.e., we sample paraphrases from different, non-matching questions at random. Thus, our data set for annotation comprises 369 phrase pairs. Annotators were first introduced to the task, shown a simple example, and then asked to assess the similarity of phrase pairs on a 6-point Likert scale (1: very different, 6: very similar) [27]. We chose a scale with an even number of items to force respondents to select between either similar or different because a neutral element “halfway similar and halfway different“ is neither meaningful nor can it be assessed. Seven participants (5 males, 2 females) took part in the annotation process. All annotators have a background in machine learning with a Master degree or a Ph.D. in computer science. We randomly assigned participants to one of 3 groups, one participant was assigned to all groups. Each group annotated approx. 125 phrase pairs, each pair has at least 2 annotations.

Fig. 4 depicts the average annotator score per phrase pair. Phrase pairs are ranked by their score, separately for the 310 paraphrase pairs and 59 negative pairs. Most of the paraphrase pairs that were generated by GPT-3 have a score , and thus are perceived as being similar, indicating that GPT-3 generates high-quality paraphrases in general. Most negative pairs, that were sampled from different questions have an average score , supporting the quality of the human annotations.

Fig. 4: Average human annotation score for all phrase pairs ranked by score. Negative pairs are paraphrases from different questions.

For our final XAI question phrase bank, we select all pairs of paraphrases with an average annotation score (Likert score of 4 means more similar than different). Each paraphrase is linked to its reference question, and the reference question is a paraphrase of itself. The task of the NLU component is then to identify the reference question for a user question, based on the known paraphrases of the reference question.

Iv-B Reference Question Retrieval (NLU)

We preprocess a given user question to a standard format. We use a placeholder feature to substitute all feature names from the data set. Similarly, labels (classes) in user questions are replaced by the placeholder class. For example, on the Adult dataset [13] (see Fig. 1), the question How could I change only Occupation to get >50K? is transformed to How could I change only feature to get class?, in which, the Occupation is a feature and >50K is a class in the Adult dataset. We assume that feature and class names in user questions match those in the data set (i.e. no typos or synonyms).

We formulate the matching of a user question to a reference question as a multi-class classification problem with class labels corresponding to reference questions in the XAI question phrase bank (see Section IV-A). First, we generate sentence embeddings of the pre-processed user and reference questions with SimCSE [20] and RoBERTa-large [11]

. We then train a feedforward network with 1 hidden layer and ReLU activation on the sentence embeddings to classify user questions into one of the reference questions in the question phrase bank. The output of this step is a reference question that reflects the intent of the original user question. From the classifier’s output, we select the reference question with the highest probability. If the probability is lower than a predefined threshold

(no match), we consider the question as an (yet) unknown variation (paraphrase) of a reference, save it for later, and ask the user for an alternative phrasing of the question. We set in our experiments. We provide an evaluation of this matching approach and a comparison to other approaches in the following.

Iv-C Evaluation of NLU Component

We compare two approaches – sentence embedding methods and traditional text classification – to implement the NLU component.

Iv-C1 Data

We use the XAI question phrase bank as described in Section IV-A, i.e., the data set with reference questions and paraphrases after manual quality control. We use the question ID as a label and assign a common label to sets of questions with identical answers in the same category. For instance, question 2 How does feature impact its predictions? and question 5 Is feature used or not used for the predictions? both ask for (binary) feature contribution of one specific feature and can be answered with the feature importance information for feature f of an XAI method. After relabeling, the final data set contains 329 instances and 52 different labels (from 73 initial questions)2. Additionally, we also evaluate our models on the subset of XAI questions only, i.e., all questions whose rows are highlighted in gray in Table III. This XAI-only set contains 111 instances and 14 labels.

Iv-C2 Preprocessing

We test two different feature representation methods, classical TF-IDF weighting, and sentence embeddings. For TF-IDF weighting, we follow a standard preprocessing pipeline: We select tokens of 2 or more alphanumeric characters (punctuation is ignored and always treated as a token separator) and stem the text using Porter Stemmer [28]

to obtain our token dictionary. Maximum and minimum DF thresholds are subject to hyperparameter optimization (see full list of hyperparameters in 

Table I). We embed sentences (i.e., question instances) using SimCSE [20] to obtain an alternative feature representation to TF-IDF. We employ the pretrained RoBERTa-large model [32] as base model in SimCSE.

Model Hyperparameter
TF-IDF max_df=[1.0,0.8]; min_df=[0.1,0.2,1]
SVM kernel=[’linear’, ’poly’, ’rbf’, ’sigmoid’, ’precomputed’];
C=[0.1, 1, 10, 100, 1000]; gamma=[0.1, 1, 10, 100];
RF bootstrap = [True, False]; max_depth:[10, None];
min_samples_leaf=[1, 2, 4]; min_samples_split=[2, 5, 10];
n_estimators=[10, 50, 100]
NN Epoch=50; batch_size=1; learning_rate=6
TABLE I: Hyperparameters for Grid Search, bold indicates the chosen hyperparameters

Iv-C3 Models

We compare four different approaches. On the TF-IDF vector space, we evaluate two classifiers commonly applied to text classification: Support Vector Machines (SVMs) with various kernels and Random Forests (RF). On the sentence embedding space (SimCSE), we compare a similarity-based approach to a supervised model. In the similarity-based approach, we rank reference questions by their cosine similarity to a user question. As a supervised model, we use a fully connected feedforward neural network (NN) with a single hidden layer of size 256, trained with cross-entropy loss. We employ grid-search for hyperparameter tuning with details available in Table 


Iv-C4 Metrics

We use 3-fold cross-validation and present mean and standard deviation of micro- and macro-averaged F1 scores. For multi-class classification, micro-averaged F1 score is equal to accuracy.

Iv-C5 Results

Table II shows the evaluation results. For reference, the optimal hyperparameters are marked in bold inTable I

. The supervised approach based on sentence embeddings (SimCSE + NN) outperforms the other approaches on both, the full data set and the XAI subset with an accuracy of 0.85 on the XAI subset. Both traditional text classification approaches (RF + TF-IDF and SVM + TF-IDF) are already outperformed by the unsupervised similarity-based approach on sentence embeddings (SimCSE + Cosine) and even more so by the supervised approach on sentence embeddings (SimCSE + NN), highlighting the efficiency of pre-trained models for natural language processing tasks.

All Questions XAI only
Accuracy Macro F1 Accuracy Macro F1
SimCSE + Cosine 0.72 0.03
SimCSE + NN 0.72 0.03 0.67 0.03 0.85 0.08 0.83 0.09
TABLE II: Evaluation results for 3-fold cross-validation. Showing mean and standard deviation on the full data set and the subset of XAI questions.

In summary, given a user question, we substitute all feature names and class names in the question with placeholders. Then, we use a pretrained model to match the preprocessed question to one of the reference questions in the question phrase bank, which we created using GPT-3 and human annotation. In the next section, we will describe how to answer each question.

V Retrieving Answer Information

After matching user input to its corresponding reference question, we need to obtain the relevant information to provide an answer. Previous work [1, 6] suggested some XAI methods as responses for question groups, but it remains unclear how to choose the exact method for each specific question, i.e. how to design a simplified dialog policy for the QX component in a conversational agent (see Fig. 2).

We analyzed all 73 reference questions in the XAI question phrase bank for their implied information need and identified methods to retrieve this information. Table III presents an overview of all questions. Specifically, we identified the questions that require an XAI approach for extracting the answer information (highlighted rows). Following the general definition of XAI by Arrieta et al. [16], we define an XAI method as a method that produces details or reasons to make the AI’s functioning clear or easy to understand. That is, an XAI method must have access to a model’s internal reasoning or to a proxy that reveals this reasoning at least to some extent. For instance, the question Why is this instance predicted P instead of Q? requires a counterfactual explanation, identifying feature sets that – if changed – would change the model’s prediction from P to the specified counterfactual class Q. On the other hand, the question What would the system predict if features [..] of this instance changed to f’, with f’ given as specific feature values, just requires to create a new test instance with the specific feature set and apply the trained (black-box) model on this new instance.

ID Category Reference Question #Phr. Method Option
1 How How are the parameters set? 6 Model Generation ModelCards [9]
2 How How does feature f impact its predictions? 4 SHAP [14], (LIME [21])
3 How How does it weigh different features? 7 SHAP [14], (LIME [21])
4 How How does the system make predictions? 4 n.a. ProtoTree [24], ProtoPNet [25], ModelExtraction [23]
5 How Is feature X used or not used for the predictions? 4 SHAP [14], (LIME [21])
6 How What are the top features it uses? 4 SHAP [14], (LIME [21])
7 How What are the top rules it uses? 4 n.a.
8 How What features does the system consider? SHAP [14], (LIME [21])
9 How What is the system’s overall logic? 4 n.a. ProtoTree [24], ProtoPNet [25], ModelExtraction [23]
10 How What kind of algorithm is used? 4 Model Generation ModelCards [9]
11 How What rules does it use? 4 n.a. ProtoTree [24]
12, 13 How to be How should this instance/feature change to get a different prediction? 7

DICE [5],

CFProto [18] for 12

14 How to be What kind of instance gets a different prediction? 4 DICE [5], CFProto [18]
15, 16 How to still What are the necessary features present/ absent to guarantee this prediction? 6 n.a. SHAP [14], (LIME [21])
17, 18 How to still What is the highest/lowest feature one can have to still get the same prediction? 12 Anchors [8]
19 How to still What is the scope of change permitted to still get the same prediction? 4 Anchors [8]
20 How to still What kind of instance gets this prediction? 4 Anchors [8]
21 Input How much data like this is the system trained on? 4 Model Generation ModelCards [9], DataSheets [10]
22, 23 Input How were the ground-truth/labels produced? 8 Data Generation DataSheets [10]
24, 25 Input What are the biases/limitations of the data? 9 Data Generation DataSheets [10]
26 Input What data is the system not using? 5 Model Generation ModelCards [9]
27 Input What is the sample size? 3 Model Generation ModelCards [9], DataSheets [10]
28 Input What is the source of the data? 3 Data Generation ModelCards [9], DataSheets [10]
29 Input What kind of data does the system learn from? 5 Model Generation ModelCards [9], DataSheets [10]
30 Output How can I best utilize the output of the system? 4 Model Generation ModelCards [9], DataSheets [10]
31 Output How is the output used for other system component(s)? 4 System Context
32 Output What does the system output mean? 4 Data/Model Generation ModelCards [9], DataSheets [10]
33 Output What is the scope of the system’s capability? Can it do [A]? 4 Model Generation ModelCards [9]
34 Output What kind of output does the system give? 3 Data/Model Generation ModelCards [9], DataSheets [10]
35–37 Performance How accurate/precise/reliable are the predictions? 12 Model Generation ModelCards [9]
38 Performance How often does the system make mistakes? 4 Model Generation ModelCards [9]
39, 40 Performance In what situations is the system likely to be correct/incorrect? 8 Model Generation ModelCards [9]
41 Performance Is the system’s performance good enough for [A]? 2 System Context
42 Performance What are the limitations of the system? 2 Model Generation ModelCards [9]
43 Performance What kind of mistakes is the system likely to make? 5 Model Generation ModelCards [9]
44 What if What would the system predict for [a different instance]? 2 Prediction
45, 46 What if What would the system predict if feature(s) f of this instance change(s) to f’? 8 Prediction
47–48 Why Why/How is this instance given this prediction? 20 SHAP [14], (LIME [21])
49 Why What features of this instance lead to the system’s prediction? 15 SHAP [14], (LIME [21])
50 Why Why are instance A and instance B given the same prediction? 4 SHAP [14], (LIME [21])
51 Why not How is this instance not predicted A? 4 DICE [5], CFProto [18]
52 Why not Why are instances A and B given different predictions? 8 DICE [5], CFProto [18]
53 Why not Why is this instance predicted P instead of Q? 11 DICE [5], CFProto [18]
54 Others How to improve the system? 4 External Knowledge
55 Others What are the results of other people using the system? 4 External Validation
56 Others What does [ML terminology] mean? 2 External Knowledge
57– 67 Others How/What/Why will the system adapt/change/drift/improve over time? 35 External Validation
68–70 Others Why NOT using this data/feature/rule? 8 n.a.
71–73 Others Why using this data/feature/rule? 14 n.a.
TABLE III: XAI questions from the original question bank [1]. Showing reference question, the number of paraphrases in our phrase bank (#Phr.), whether the question requires an XAI method for the answer (highlighted rows), and (optional) means of providing this information (see Section V-B for highlighted rows, Section V-A for others). “n.a.”: no method matches the selection criteria. Alternative means in the option column are not always available, limited to certain types of data/models or provide only partial information.

In the following, we first discuss the information needed for the 50 questions that do not require an XAI method, and outline how the information for the answer can be obtained. Second, we discuss the 23 XAI questions (highlighted rows in Table III) and present our criteria for analysing existing XAI methods and our final mapping from reference question to XAI-method for extracting the answer information.

V-a Non XAI Specific Questions

Analysing the 50 questions in this category we identified six subcategories w.r.t. information need. The categories are indicated in column “Method” in Table III.

V-A1 Data Generation

Questions in this subcategory require information about the data or the data generation process and can either be directly answered by querying data set statistics or accessing an accompanying data sheet for the data set [10] if available. Examples of such questions are What is the sample size? or How were the ground-truth labels produced?.

V-A2 Model Generation

This subcategory of questions can be answered by retrieving easily accessible information about the underlying machine learning model, such as hyperparameters set during training or evaluation results. If the model is equipped with a ModelCard [9], the information can be obtained from the latter. Example questions for this category are How often does the system make mistakes? and What kind of algorithm is used? ModelCards also contain information about biases, scope, and limitations of the machine learning model, thus containing information for other questions, such as What is the scope of the systems ability?.

V-A3 Prediction

Questions in this subcategory, such as What would the system predict for [a different instance]?, can simply be answered by applying the model on a newly generated test instance.

V-A4 External Knowledge

This kind of question requires additional information from either humans or an additional external knowledge base and an information retrieval approach to access the information. For example, the question How to improve the system?

requires domain knowledge on model optimization, a judgment of model performance in comparison with similar approaches, and, for instance, an estimate of whether additional training data are likely to improve performance. For the question

What does [ML terminology] mean?

we envision a knowledge base or lexicon with Machine Learning terms that can be searched for the information.

V-A5 External Validation

Questions in this subcategory, such as How will the system improve over time, require additional evaluation during the system’s lifetime and/or information from similar systems. The simplest and easily accessible information would be a binary indicator of whether the system is capable of online learning at all, but more details require elaborate experiments.

V-A6 System Context

This group of questions asks for information about the integration of the machine learning model with other system components and their embedding in the application. E.g., the question How is the output used for other system component(s)? depends on the actual system deployment, and Is the system good enough for [A]? requires knowledge about the application context and its requirements.

V-B XAI Questions

To identify suitable candidates able to provide the information for answering the remaining XAI questions (highlighted rows in Table III), we conduct a literature survey. Our sources are the four key XAI surveys [15, 16, 3, 4], the methods referred to in the initial XAI question bank and the follow-up work by the same authors [1, 6]. To further filter the candidate set for XAI methods to incorporate in conversational agents, we defined the following four criteria for our analysis, the latter three based on the categorization scheme of a recent survey [12], which identified 312 original papers, presenting a novel XAI technique in one of the major ML/AI conferences since 2014 until 2020:

  • Source Code: The paper should be accompanied by easy-to-use, publicly accessible source code in order to integrate the method into the conversational agent in a reasonable amount of time. This criterion proved to reduce the number of methods significantly.

  • Model: We assessed to which types of models the XAI method is applicable, and favored approaches that are model agnostic [16] or at least can be applied to multiple types of models.

  • Data: We assessed which type of input data the XAI method is applicable for. We want to keep our XAI method library as small as possible in order to enable efficient implementation of components that process the XAI method’s raw output to generate a user-friendly natural language answer, possibly accompanied by visualizations (see Section VI). We, therefore, favor methods that are applicable to multiple types of input data.

  • Problem: Explanation methods may be restricted to particular machine learning tasks, e.g., regression. We account for this restriction by including the category “type of problem” in our analysis. In this paper, we focus on explaining models for supervised machine learning, more specifically classification tasks.

Table IV shows an overview of the final selected methods and their corresponding questions from Table III for which they provide the necessary information. We briefly describe the selected methods in the following.

Method Question ID from Table III
SHAP [14], LIME [21] 2, 3, 5, 6, 8, 47–50
DICE [5] 12-14 51–53 (Tabular data)
CFProto [18] 12,14, 51–53 (Image data)
Anchor [8] 17–20
TABLE IV: Selected XAI techniques

SHAP [14] and LIME [21] are two widely used methods to quantify the importance of features for the prediction of a single instance. While LIME locally approximates the decision boundary with a linear model whose coefficients represent local feature importance, SHAP follows a game-theoretic approach to identify the contribution of each feature (player) in an additive setting. Quantitative feature importance values are relevant for answering questions w.r.t. feature contribution on the prediction (questions 2, 3, 5, 8, 48) as well as for top features (question 6). Furthermore, feature importance also can explain why/how the prediction is given (questions 47, 50). To answer question 49 Why are instance A and instance B given the same prediction?, we show one explanation per instance. SHAP’s feature importance values have been shown to be more consistent with human judgment than LIME [14] and can explain both, image and tabular data. We, therefore, use SHAP to answer the questions that require feature importance information.

DICE [5] is an explanation method, focusing on counterfactual explanations for tabular data. Given an instance, DICE searches for the minimum feature changes required to get a different prediction, and therefore provides information to answer questions 12, 14, and 51–53. DICE can also identify required changes for a specific feature to change the prediction to a different target class and therefore provides the information for answering question 13: How should this feature change to get a different prediction?. Similar to DICE, CFProto [18] is a counterfactual explanation method, and is applicable to image data. However, the method does not allow to change single, specific features to obtain a target class, because features in image data are hard to define. Thus, we apply CFProto to gather information for answering questions 12, 14, and 51–53 on image data. Anchor [8] computes sufficient conditions for a prediction, so called anchors, such that as long as the anchor holds, changes to the remaining feature values of the instance do not matter. Therefore, it can be used to determine the boundaries of the prediction, which are suitable to answer questions 17–20.

In summary, to integrate XAI into the dialogue policy component, we systematically map the XAI questions to XAI methods. We describe how we use the output of the XAI methods to generate answers for the questions in the next section.

Vi Generating Answers

XAI methods provide the core information to answer the corresponding questions, but they lack explanatory text in natural language for the end user. For end users, presenting just the raw information in form of a table or importance values alongside feature names is not always adequate. Instead, additional context, such as what the values represent or how to interpret them is desirable. To address this problem, we incorporate a template-based natural language component. For each question, we define text templates (partially with dataset-specific vocabulary) containing placeholders for the information obtained from XAI methods. We combine this information with – or convert it to – textual explanations depending on the type of information generated by the XAI method. For images and graphs (e.g. SHAP’s outputs), we add a textual explanation. For tabular format (e.g. DICE’s outputs), we convert the table to natural language by extracting feature names and corresponding values. For example, in Fig. 1, the answer “The Occupation should be changed to Tech-support to get an income of more than 50K” is the combination of the template “The feature_A should be changed to value_V to get class” and the counterfactual information obtained from DICE [5] (the information was transformed from tabular format to text). In case of counterfactual explanations, relations need to be extracted in addition to feature names and values. For the counterfactual question 53 Why is this instance predicted P instead of Q?), we extract and compare the relation between the given instance with class P and a counterfactual instance with the target class Q. For example, in the Adults dataset, consider an instance with the feature “Age = 39” classified as <=50K and its counterfactual with “Age = 66.3” (and otherwise identical features) classified as >50K. For these instances, the relation is “Age is too low” (see Fig. 1). Internally, this comparison is again represented by placeholders and predefined text, which we combine with the XAI information to return the answer.

Vii Conversation Scenarios

In this section, we show example conversations between a prototype implementation of our proposed framework and a user on different data sets (tabular data and images) with different types of predictive models.

Vii-a Random Forest Classifier on Adult Data

We train a Random Forest (RF) classifier on the Adult dataset [13]. The task on this data set is to predict whether the income exceeds $50.000/year (abbreviated 50K) based on census data. We train the classifier using the sklearn library [26] and its standard parameter settings.444 The mean accuracy of the classifier using 3-fold cross-validation is 0.85. For explanations, we retrain the RF classifier with the same parameter settings on the full data set. The data set and the classifier are loaded at the beginning of the conversation.

Fig. 1 shows a conversation with the prototype agent (X-Agent). At the start of the conversation, the user inputs information about her features. Due to space constraints, we omit this part of the conversation in Fig. 1 and show how the X-Agent reacts to several questions about the model.

The first question is the request: Give me the reason for this prediction! The natural language understanding (NLU) component matches this question to the reference question Why is this instance given this prediction? in the question bank (question 47 in Table III). The Question-XAI method mapping (QX) selects SHAP [14] as the XAI method to provide the information for the answer. The natural language generation (NLG) component combines SHAP’s feature importance information with the predefined text “The above graph …” to respond to the user question.

For the next question, Why is this profile predicted <=50K instead of >50K, the labels <=50K and >50K are replaced by the token class before matching to reference question 53 in Table III Why is this instance predicted P instead of Q?. The QX component identifies DICE [5] as the explanation method for this reference question, and the information is translated into natural language. In detail, DICE returns a counterfactual instance with the desired target label (>50K), yielding two features (Age and Workclass) that need to change in order to obtain the desired prediction. The NLG component extracts the relations between feature values of the original (Age:39, Workclass:State-gov) and counterfactual instance (Age:66.3, Workclass:Self-emp-inc). In comparison to the counterfactual, Age of the original instance is lower and Workclass differs. These relations are converted and rendered as text in the final answer by the NLG component.

For the final question That’s hard, how could I change only Occupation to get >50K prediction?, the words “Occupation” and “>50K” are substituted by tokens feature and class respectively. Then, the question is matched to reference question 13 (see Table III) How should this feature change to get a different prediction?. DICE is again determined as the XAI method for providing the required information to answer this question. However, this question asks for a specific feature, i.e., constrains the search space of DICE for counterfactuals. Finally, the provided information is again translated to natural language.

Vii-B Convolutional Neural Network on MNIST

Fig. 5:

Conversation example to explain a Convolutional Neural Network on MNIST

We use the MNIST data set and a pre-trained convolutional neural network [18] to showcase a conversation on an image data set (see Fig. 5). First, NLU matches the first question Why did you predict that? to reference question 47 Why is this instance given this prediction? (see Table III). Then, QX maps this question to SHAP [14] as the explanation technique. SHAP highlights the important parts on the image that lead to prediction 7. NLG adds an explanation in form of natural language text to the information provided by SHAP (the image). For the second question How should this image change to get number 9 predicted?, number 9 is replaced by token class. NLU maps this processed question to reference question 12 (see Table III). QX identifies CFProto [18] as the method to answer this question. The output of CFProto is the modified image close to number 9. Finally, NLG generates the explanation text along with the output of CFProto.

Viii Conclusion

Following the conversational style of human-to-human explanations, we leveraged a conversational agent to explain machine learning models. To capture the variance of questions that can be asked about the topic, we extended an XAI question bank with paraphrases. Each question-paraphrases set defines a specific information need, represented by a reference question. We presented a systematic analysis of methods that can address those information needs aiming at a sufficient, but small subset of all available XAI methods. Our XAI question phrase bank and XAI method collection, which are publicly available2, can serve as guidance for the future development of XAI conversational agents. In future work, we plan to integrate a learning component for dialog policies to make the system self-adaptable from interactions.


  • [1] Liao, Q., Gruen, D. & Miller, S. Questioning the AI: Informing Design Practices for Explainable AI User Experiences. Proceedings Of The 2020 CHI Conference On Human Factors In Computing Systems. pp. 1-15 (2020),
  • [2] Werner, C. Explainable AI through Rule-based Interactive Conversation. EDBT/ICDT Workshops. (2020)
  • [3] Gilpin, L., Bau, D., Yuan, B., Bajwa, A., Specter, M. & Kagal, L. Explaining Explanations: An Overview of Interpretability of Machine Learning.

    2018 IEEE 5th International Conference On Data Science And Advanced Analytics (DSAA)

    . pp. 80-89 (2018)
  • [4] Adadi, A. & Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access. 6 pp. 52138-52160 (2018)
  • [5] Mothilal, R., Sharma, A. & Tan, C. Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations. Proceedings Of The 2020 Conference On Fairness, Accountability, And Transparency. pp. 607-617 (2020),
  • [6] Liao, Q. & Varshney, K. Human-Centered Explainable AI (XAI): From Algorithms to User Experiences. ArXiv:2110.10790 [cs]. (2022,1),
  • [7] Rastogi, A., Zang, X., Sunkara, S., Gupta, R. & Khaitan, P. Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset. Proceedings Of The AAAI Conference On Artificial Intelligence. 34, 8689-8696 (2020,4),
  • [8] Ribeiro, M., Singh, S. & Guestrin, C. Anchors: High-Precision Model-Agnostic Explanations. Proceedings Of The AAAI Conference On Artificial Intelligence. 32, 1527-1535 (2018,4),
  • [9] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. & Gebru, T. Model Cards for Model Reporting. Proceedings Of The Conference On Fairness, Accountability, And Transparency. pp. 220-229 (2019),, event-place: Atlanta, GA, USA
  • [10] Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J., Wallach, H., III, H. & Crawford, K. Datasheets for Datasets. Commun. ACM. 64, 86-92 (2021,11),, Place: New York, NY, USA Publisher: Association for Computing Machinery
  • [11] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. & Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv:1907.11692 [cs]. (2019,7),
  • [12] Nauta, M., Trienes, J., Pathak, S., Nguyen, E., Peters, M., Schmitt, Y., Schlötterer, J., Keulen, M. & Seifert, C. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. ArXiv:2201.08164 [cs]. (2022,1),
  • [13] Kohavi, R. Adult Data Set. (1996),
  • [14] Lundberg, S. & Lee, S. A unified approach to interpreting model predictions. Proceedings Of The 31st International Conference On Neural Information Processing Systems. pp. 4768-4777 (2017,12)
  • [15] Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F. & Pedreschi, D. A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys. 51, 93:1-93:42 (2018,8),
  • [16] Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R. & Herrera, F. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion. 58 pp. 82-115 (2020,6),
  • [17] Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence. 267 pp. 1-38 (2019,2),
  • [18] Van Looveren, A. & Klaise, J. Interpretable Counterfactual Explanations Guided by Prototypes. Machine Learning And Knowledge Discovery In Databases. Research Track. pp. 650-665 (2021)
  • [19] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. & Others Language models are few-shot learners. Advances In Neural Information Processing Systems. 33 pp. 1877-1901 (2020)
  • [20] Gao, T., Yao, X. & Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings Of The 2021 Conference On Empirical Methods In Natural Language Processing. pp. 6894-6910 (2021,11),
  • [21] Ribeiro, M., Singh, S. & Guestrin, C. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Proceedings Of The 22nd ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 1135-1144 (2016,8),
  • [22] Jurafsky, Daniel, and James H. Martin. ”Speech and language processing (draft).” preparation. Available from: https://web. stanford. edu/  jurafsky/slp3 (2018).
  • [23] Bastani, O., Kim, C. & Bastani, H. Interpretability via model extraction. ArXiv Preprint ArXiv:1706.09773. (2017)
  • [24] Nauta, M., Bree, R. & Seifert, C. Neural Prototype Trees for Interpretable Fine-Grained Image Recognition.

    Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition (CVPR)

    . pp. 14933-14943 (2021,6)
  • [25] Chen, C., Li, O., Tao, C., Barnett, A., Su, J. & Rudin, C. This Looks like That: Deep Learning for Interpretable Image Recognition. Proceedings Of The 33rd International Conference On Neural Information Processing Systems. (2019)
  • [26] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. & Duchesnay, E. Scikit-learn: Machine Learning in Python. Journal Of Machine Learning Research. 12 pp. 2825-2830 (2011)
  • [27] Amidei, J., Piwek, P. & Willis, A. The use of rating and Likert scales in Natural Language Generation human evaluation tasks: A review and some recommendations. Proceedings Of The 12th International Conference On Natural Language Generation. pp. 397-402 (2019,10),
  • [28] Manning, C., Raghavan, P. & Schütze, H. Introduction to Information Retrieval. (Cambridge University Press,2008),
  • [29] Bobrow, D., Kaplan, R., Kay, M., Norman, D., Thompson, H. & Winograd, T. GUS, a frame-driven dialog system. Artificial Intelligence. 8, 155-173 (1977),
  • [30] Kuzba, Michal and P. Biecek. “What Would You Ask the Machine Learning Model? Identification of User Needs for Model Explanations Based on Human-Model Conversations.” PKDD/ECML Workshops (2020).
  • [31] Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning. (MIT Press,2016),
  • [32] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. & Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. (2019)
  • [33] Tomsett, Richard, Dave Braines, Dan Harborne, Alun Preece, and Supriyo Chakraborty. ”Interpretable to whom? A role-based model for analyzing interpretable machine learning systems.” WHI 2018.
  • [34] McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020)
  • [35] Jianfeng Gao, Michel Galley, and Lihong Li. 2018. Neural Approaches to Conversational AI. In The 41st International ACM SIGIR Conference on Research; Development in Information Retrieval (SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 1371–1374.