An Objective Metric for Explainable AI: How and Why to Estimate the Degree of Explainability

09/11/2021
by   Francesco Sovrano, et al.
University of Bologna
0

Numerous government initiatives (e.g. the EU with GDPR) are coming to the conclusion that the increasing complexity of modern software systems must be contrasted with some Rights to Explanation and metrics for the Impact Assessment of these tools, that allow humans to understand and oversee the output of Automated Decision Making systems. Explainable AI was born as a pathway to allow humans to explore and understand the inner working of complex systems. But establishing what is an explanation and objectively evaluating explainability, are not trivial tasks. With this paper, we present a new model-agnostic metric to measure the Degree of eXplainability of correct information in an objective way, exploiting a specific model from Ordinary Language Philosophy called the Achinstein's Theory of Explanations. In order to understand whether this metric is actually behaving as explainability is expected to, we designed a few experiments and a user-study on two realistic AI-based systems for healthcare and finance, involving famous AI technology including Artificial Neural Networks and TreeSHAP. The results we obtained are very encouraging, suggesting that our proposed metric for measuring the Degree of eXplainability is robust on several scenarios and it can be eventually exploited for a lawful Impact Assessment of an Automated Decision Making system.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/12/2021

Explainability and the Fourth AI Revolution

This chapter discusses AI from the prism of an automated process for the...
06/19/2020

Does Explainable Artificial Intelligence Improve Human Decision-Making?

Explainable AI provides insight into the "why" for model predictions, of...
12/15/2020

Towards Grad-CAM Based Explainability in a Legal Text Processing Pipeline

Explainable AI(XAI)is a domain focused on providing interpretability and...
08/27/2019

Explainable AI: A Neurally-Inspired Decision Stack Framework

European Law now requires AI to be explainable in the context of adverse...
10/16/2019

Do Explanations Reflect Decisions? A Machine-centric Strategy to Quantify the Performance of Explainability Algorithms

There has been a significant surge of interest recently around the conce...
11/02/2021

Instructive artificial intelligence (AI) for human training, assistance, and explainability

We propose a novel approach to explainable AI (XAI) based on the concept...
09/03/2020

Explainable Empirical Risk Minimization

The widespread use of modern machine learning methods in decision making...

Code Repositories

DoXpy

For Computing the Degree of Explainability


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability of humans to understand and appreciate the outputs and behaviours of autonomous systems is complicated by the increasing complexity and unpredictability of the most recent and advanced systems.

Automated Decision-Making systems are changing industry and society, and people and governments (e.g., EU’s, California, etc.) have begun to be concerned about the impact that it may have on our lives. This concern gave birth to the so-called Right to Explanation, which, among others, was introduced in the EU legislation within the General Data Protection Regulation (GDPR), and further explored by the High-Level Expert Group on Artificial Intelligence (AI-HLEG), established in 2018 by the EU Commission.

As a result, the EU posed an interesting challenge to the eXplainable AI (XAI) community, by demanding more transparent, user-centred, and accountable approaches to ADM that guarantee explainability of their working. More precisely, the GDPR art. 35 requires data controllers to prepare a Data Protection Impact Assessment (DPIA) for operations that are “likely to result in a high risk to the rights and freedoms of natural persons”. To this end, Algorithmic Impact Assessment (AIA) can be intended as an instrument for ensuring certain minimal criteria kaminski2019algorithmic of explainability in ADMs, serving as an important “suitable safeguard” (Article 22) of individual rights.

This is certainly one of the reasons why we may be interested in any metric for automatically measuring the degree of explainability of information. In fact, controllers who use machine learning systems for processing of personal data should be able to argue in cases when Data Subjects or

Data Protection Officers111See articles 37-39 of the GDPR for more details of what is a DPO. quarrel that the logic of processing is explained way too vaguely, that they did what they could, providing an acceptable level of objective explainability of the respective algorithms.

In this paper, we propose a new model-agnostic approach and metrics to objectively evaluate explainability in a manner that is mainly inspired by Ordinary Language Philosophy instead of Cognitive Science. Our approach is based on a specific model of explanation, called the Achinstein’s theory of explanations, where explanations are the result of an illocutionary (i.e., broad yet pertinent and deliberate) act of pragmatically answering to a question. Accordingly, explanations are actually answers to many different basic questions (archetypes) each of which sheds a different light over the concepts being explained. As consequence, the more (archetypal) answers an ADM is able to give about the important aspects of its explanandum, the more it is explainable.

We assert that it is possible to quantify the Degree of eXplainability (DoX) of a dataset by applying the Achinstein-based definition of explanation proposed in sovrano2021philosophy. Thus, drawing also from Carnap’s criteria of adequacy of an explanation novaes2017carnapian, we frame the DoX as the average Explanatory Illocution of information on the Explanandum Aspects222novaes2017carnapian use the term explicandum where we employ explanandum, but, by and large, we assume the two words can be used interchangeably. They both mean “what has to be explained” in Latin.. More precisely, we propose an algorithm for measuring the DoX by means of pre-trained language models for general-purpose question-answering, as karpukhin2020dense,bowman2015large. Hence, we made the following hypothesis.

Hypothesis 1 (Main)

the DoX can describe explainability, so that given the same explanandum, an higher DoX implies greater explainability and a lower DoX implies less explainability.

To verify this hypothesis, we designed two experiments, both with the objective of showing that explainability changes in accordance with DoX.

The first experiment follows a direct approach, comparing the DoX of a XAI-based systems with its non-explainable counterpart. This approach is said to be direct, because the amount of explainability of a XAI-based system is, by design, clearly and explicitly dependent on the output of the underlying XAI. Therefore, by filtering away the XAI’s output, the overall system can be forced to be not explainable enough, by construction.

The second experiment follows an indirect approach, analysing the expected effects of explainability on the explainees. In fact, if the hypothesis is correct, the lower is the DoX, the less explanations can be extracted, the less effective (as per ISO 9241-210) is likely to be an explainee in achieving those explanatory goals that are not covered by the explanations. Hence, once fixed all the components that may affect effectiveness, including the presentation logic (the mechanism for re-elaborating explainable information into explanations) and the explanandum, an increase in the DoX should always correspond to a proportional increase of effectiveness, at least in those tasks covered by the information provided by the increment of DoX. To show this, we performed a user-study, measuring how the DoX affects the effectiveness of the explanations given by some XAI-based applications, when changing the underlying explainable information.

We ran both the experiments on two XAI-based systems, respectively for the healthcare and finance domains:

  • A heart disease predictor

    based on XGBoost

    chen2016xgboost and TreeSHAP lundberg2020local.

  • A credit approval system based on a simple Artificial Neural Network and on CEM dhurandhar2018explanations.

In both the experiments the results were encouraging, showing that hypothesis 1 holds. Therefore, we believe that our technology for estimating the DoX might be used for a lawful Algorithmic Impact Assessment (AIA), as soon as what is needed to be explained can be identified under the requirements of the law, in the form of a set of precise Explanandum Aspects.

This paper is structured as follows. In Section 2 we give the necessary background information to properly introduce the theoretical models discussed in the remaining sections, including Achinstein’s theory of explanations. In Section 3 we show how it is possible to quantify the degree of explainability by defining explaining as an illocutionary act of question answering and by automatically verifying Carnap’s criteria by means of deep language models. In Section 4 we compare our proposed solution to existing literature. While in Sections 5 we present and discuss our experiment, concluding in Section 6.

For guaranteeing the reproducibility of the experiments, we publish the source code333https://github.com/Francesco-Sovrano/DoXpy of our proposed new metric, as well as the code of the XAI-based systems, the user-study and the remaining data mentioned within this paper.

2 Background

In this section we provide enough background to understand and support the rest of the paper. Hereby we briefly summarise a number of recent and not-so-recent approaches to the theories of explanation, with a particular due focus on Achinstein’s. After that, we discuss how Achinstein’s theory of explaining as a question answering process is compatible with existing XAI literature, highlighting how deep is in this field the connection between answering questions and explaining.

2.1 Contemporary Developments in the Theory of Explanation

In philosophy the terms “truth” and “explanation” have divergent interpretations mayes2005theories. On one hand, according to the realist interpretation “the truth and explanatory power of a theory are matters of the correspondence of language with an external reality”, so it is more about depicting reality with words and intents. On the other hand, the epistemic interpretation defines explanation as a mere re-ordering of phenomena and experience to a greater degree, focusing on its power to explain observable phenomena rather than its literal truth. The failure to distinguish these senses of “explanation” can and does foster disagreements that are purely semantic in nature.

According to mayes2005theories, explanation in philosophy has been conceived within the following five traditions:

  • Causal Realism salmon1984scientific: explanation as a non-pragmatic articulation of the fundamental causal mechanisms of a phenomenon.

  • Constructive Empiricism van1980scientific

    : epistemic theory of explanation that draws on the logic of why-questions and on a Bayesian interpretation of probability.

  • Ordinary Language Philosophy achinstein1983nature: the act of explanation as the illocutionary attempt to produce understanding in another by answering questions in a pragmatic way.

  • Cognitive Science holland1989induction: explaining as a process of belief revision, etc..

  • Naturalism and Scientific Realism sellars1963philosophy: rejects any kind of explanation of natural phenomena that makes essential reference to unnatural phenomena. Explanation is not something that occurs on the basis of pre-confirmed truths. Rather, successful explanation is actually part of the process of confirmation itself.

Only the first tradition is not pragmatic, while the others are pragmatic.

2.2 Achinstein’s Theory of Explanations

In 1983, achinstein1983nature was one of the first scholars to analyse the process of generating explanations as a whole, introducing his philosophical model of a pragmatic explanatory process.

According to the model, explaining is an illocutionary act coming from a clear intention of producing new understandings in an explainee by providing a correct content-giving answer to an open question. Therefore, according to this view, answering by “filling the blank” of a pre-defined template answer (as most of One-Size-Fits-All approaches do) prevents the act of answering from being explanatory, by lacking illocution. These conclusions are quite clear and explicit in Achinstein’s last works achinstein2010evidence, consolidated after a few decades of public debates.

More precisely, according to Achinstein’s theory, an explanation can be summarized as a pragmatically correct content-giving answer to questions of various kinds, not necessarily linked to causality. In some contexts, highlighting logical relationships may be the key to making the person understand. In other contexts, pointing at causal connections may do the job. And in still further contexts, still other things may be called for.

As consequence we can see a deliberate absence of a taxonomy of questions (helpful to categorize and better understand the nature of human explanations) to refer. This apparently results in a refusal to define a quantitative way to measure how pertinent an answer is to a question, justified by the important assertion that explanations have a pragmatic character, so that what exactly has to be done to make something understandable to someone may (in the most generic case) depend on the interests and background knowledge of the person seeking understanding douven2012peter.

In this sense, the strong connection of Achinstein’s theory to natural language and natural users is quite evident, for example in the Achinsteinian concept of elliptical understandings as “understandings of what significance or importance X has in the present context” achinstein2010evidence or in the concept of u-restrictions where an utterance/explanation can be said to express a proposition if and only if it can appear in (many) contexts reasonably known by the explainee. But, despite this, Achinstein does not reject at all the utility of formalisms, hence suggesting the importance of following instructions (protocols, rules, algorithms) for correctly explaining some specific things within specific contexts.

2.3 XAI and Question Answering

The idea of answering questions as explaining is not new to the field of XAI and it is also quite compatible with everyone’s intuition of what constitutes an explanation.

Two distinct types of explainability are predominant in the literature of eXplainable AI: rule-based and case-based. Rule-based explainability is when the explainable information is a set of formal logical rules describing the inner logic of a model, its chain of causes and effects, how it behaves, why that output given the input, what would happen if the input were different, etc. While case-based explainability is when the explainable information is a set of input-output examples (or counter-examples) meant to give an intuition of the model’s behaviour. An example of this are the counterfactuals, the contrastive explanations, or the prototypes444Instances of the ground-truth considered to be similar to a specific input-output for which the similarity explains the model’s behaviour., etc.

Despite the different types of explainability one can choose, it appears to be always possible to frame the information provided by explainability with one or (sometimes) more questions. In fact, it is common to many works in the field ribera2019can; lim2009and; miller2018explanation; gilpin2018explaining; dhurandhar2018explanations; wachter2017counterfactual; rebanal2021xalgo; jansen2016s; madumal2019grounded the use of archetypal (e.g. why, who, how, when, etc.) or more punctual questions to clearly define and describe the characteristics on explainability, regardless its type.

For example, lundberg2020local assert that the local explanations produced by their TreeSHAP (an additive feature attribution method for feature importance) may enable “agents to predict why the customer they are calling is likely to leave” or “help human experts understand why the model made a specific recommendation for high-risk decisions”.

While dhurandhar2018explanations clearly state that they designed CEM (a method for the generation of counterfactuals and other contrastive explanations) to answer the question “why is input x classified in class y?”.

Also, rebanal2021xalgo propose and studies an interactive approach where explaining is defined in terms of answering why-what-how questions. These are just some examples, among many, of how Achinstein’s theory of explanations is already implicit in existing XAI literature, highlighting how deep is in this field the connection between answering questions and explaining. A connection that has been implicitly identified also by lim2009and, miller2018explanation and gilpin2018explaining that analysing XAI literature were able to hypothesise that a good explanation, about an automated decision-maker, answers at least the following questions:

  • What did the system do?,

  • Why did the system do P?,

  • Why did the system not do X?,

  • What would the system do if Y happens? ,

  • How can I get the system to do Z, given the current context?

  • What information does the system contain?

Nonetheless, despite its compatibility, practically none of the works in XAI ever explicitly mentioned Ordinary Language Philosophy’s theories, preferring to refer Cognitive Science’s miller2018explanation; hoffman2018metrics instead. This is probably because Achinstein’s illocutionary theory of explanations is seemingly difficult to be implemented into a software, by being utterly pragmatic and by failing to give a precise definition of illocution as intended for a computer program. In fact, user-centrality is challenging and sometimes not clearly connected to XAI’s main goal of “opening the black-box” (e.g. understanding how and why an opaque AI model works).

Therefore it appears that XAI is more focused on producing explainable software and explanations that generally follow a One-Size-Fits-All approach, by answering well to just one (or sometimes few) punctual pre-defined questions. This may become a problem when XAI-generated explanations alone have to be deployed in real-world applications, to real users.

To summarize wang2019designing; mueller2019explanation; miller2018explanation; miller2017explainable, despite several efforts to tackle these issues (i.e. miller2018explanation,darpa2016broad), we can notice a majority of XAI tools:

  1. Framing different and convenient definitions555Definitions that are also frequently detached from the main accepted theories in philosophy, even if that is not a problem by itself. of what constitutes explanations, relying on intuitions characterised by the authors.

  2. Lacking a consistent approach to evaluate the quality of explanations, due to the fact that each tool has its own definition of it.

  3. Lacking focus on user-centrality, being inclined towards One-Size-Fits-All approaches, e.g. trying to anticipate and define the specific questions that a normal user might pose.

2.4 Automated Question Answering

Many deep language models exist for estimating the pertinence of an answer with respect to a given question. This task may have different names and natures, including: Question Answering, Question-Answer Retrieval or Dense Passage Retrieval. Among them, plain Question Answering is the most diffused and common approach, and it usually consists in both identifying and selecting, end-to-end, an answer from a whole document (text or image) given as input with the question.

Current state-of-the-art general-purpose Question Answering algorithms, as those collected by wolf2019transformers, have usually a quadratic complexity in the size of the whole document to search for, producing very short (e.g. 2-3 words) answers, with an attached pertinence score.

On the other side, Question-Answer Retrieval (or Dense Passage Retrieval

) follows a different approach, disentangling the identification of possible answers, from the selection of the most pertinent one. More precisely, it is a mechanism for embedding questions and answers, so that the inner product (or other similarity metrics, i.e. the cosine similarity) between the embedding of the question and that of an answer is a measure of the pertinence of the latter to the first.

Among the most important Question-Answer Retrieval models we distinguish between those that use the answer’s context for the generation of the embedding yang2019multilingual; karpukhin2020dense; roy2020lareqa and those who do not nogueira2019passage. Intuitively, using the answer’s context should help the answer embedder to contextualise and disambiguate better, producing more high-quality embeddings.

Differently from plain Question Answering, Question-Answer Retrieval is less end-to-end, requiring the a priori identification of the possible snippets of text functioning as answers, but it is much faster. In fact, it has a complexity that is usually proportional to the product of the size of the context (if any, normally a small paragraph) and the size of the answer (commonly contained in the context). We will refer to these (contextualised) possible answers as information units, so that Question-Answer Retrieval consists in the identification of the proper information units for answering a given question.

3 How to Quantify a Degree of Explainability

Not surprisingly, the informative contents of state-of-the-art XAI is clearly polarised towards answering “why”, “what-if” or “how” questions. For example, feature importance techniques as LIME ribeiro2016should, SHAP lundberg2017unified or Attention Masks vaswani2017attention are more about explaining how much a part of the input has been considered for the output. While contrastive and prototype selection techniques as CEM dhurandhar2018explanations or ProtoDash gurumoorthy2019efficient are more about concisely telling why input x is associated to y, or what would happen if input x would be different. On the other side, rule-extraction techniques dash2018boolean; wei2019generalized have the potential to answer “why”, “how” and “what-if” questions, depending on how rules are framed.

Considering that “why”, “what-if” and “how” are different questions pointing to different types of information, then which type is the best one? We assert that the correct answer is “none”. In fact, depending on the needs of the explainee, its background knowledge, the context, and potentially many other factors, each one of these explanation archetypes may be equally needed, together with all the others we left out, including: “what”, “who”, “when”, etc..

Every different XAI mechanism seems to be highly specialised on a few facets of the explanandum, providing different details that may be useful or not, depending on the needs of the explainee. In other words, depending on the characteristics of the explainee (e.g. background knowledge, objectives, etc..), a combination of different XAI mechanisms may be required to really give meaningful and trustable insights on the inner logic of a black-box. Therefore, knowing the types of explainability that are covered by a XAI-based system, and also being able to quantify the degree of explainability involved, can be of utmost importance for understanding how explainable is a system and even whether it is really compliant with existing regulations.

Considering that explainability is fundamentally the ability to explain, it is clear that a proper definition of it requires a precise understanding of what constitutes the act of explaining. To this end, as pointed out also in Section 2, many definitions exist of what is an explanation, many of them overlapping. The problem of such definitions is that, by nature, they involve very abstract concepts, relying on more philosophical aspects yet to be defined. For example, Achinstein’s definition heavily relies on the notion of illocution to frame the act of explaining, while Cognitive Science’s on the notion of “belief revision” or many others on the notion of truth, etc..

Despite its criticalities, Achinstein’s theory seemed to us the most suitable among all, for our purposes, for allowing the assessment of the quality of an explanation on the base of its pragmatic relevance to a question666

A task that may seem too onerous and subjective, nonetheless recent developments in modern artificial intelligence have shown there might exist tools

yang2019multilingual; roy2020lareqa to objectively estimate the pertinence of an answer, allowing for the automation of a question answering process.. More precisely, according to Achinstein’s theory, explanations are the result of an illocutionary act of pragmatically answering to a question. In particular, it means that there is a subtle and important difference between simply “answering to questions” and “explaining”, and this difference is illocution.

It appears that an illocutionary act results from a clear intent of achieving the goal of such act, as a promise being “what it is” just because of the intent of maintaining it. So that illocution in explaining makes an explanation as such just because it is the result of an underlying and proper intent of explaining.

Notwithstanding this definition, illocution seems to be too abstract to be implementable into a concrete software. Nonetheless, recent efforts towards the automated generation of explanations sovrano2021philosophy, have shown that it may be possible to define illocution in a more “computer-friendly” way. As stated by sovrano2021philosophy, illocution in explaining involves informed and pertinent answers not just to the main question, but also to other questions of various kinds, even unrelated to causality, that are relevant to the explanations. Such questions can be understood as instances of archetypes such as why, why not, how, what for, what if, what, who, when, where, how much, etc..

Archetypal questions provide generic explanations on a specific aspect of the explanandum, in a given informative context, with a local or a global slant (i.e. linked or not to the specific computation as performed), which can precisely link the content to the informative goal of the person asking the question.

If we assume that the interpretation of Achinstein’s theory of explanations given by sovrano2021philosophy is correct, then data or processes are said to be explainable when their informative content can adequately answer archetypal questions. Therefore, we propose accordingly that the degree of explainability of information depends on the amount of archetypal questions to which it can answer properly.

This definition of explainability is also indirectly confirmed by existing literature. In fact, as discussed in Section 2, in many works ribera2019can; lim2009and; miller2018explanation; gilpin2018explaining; dhurandhar2018explanations; wachter2017counterfactual; rebanal2021xalgo; jansen2016s; madumal2019grounded there is a clear and explicit mention of what content to include in an explanation, always identified by means of archetypal or more punctual questions. Also jansen2018worldtree, in a detailed analysis of the knowledge and inference requirements for science exams of elementary schools, identify that many types of knowledge are necessary to generate correct explanations/answers, including: causality, actions, purposes, descriptions, etc..

Definition 1 (Archetypal Question)

An archetypal question is an archetype applied on a specific aspect of the explanandum. Examples of archetypes are the interrogative particles (why, how, what, who, when, where, etc.), or their derivatives (why-not, what-for, what-if, how-much, etc.), or also more complex interrogative formulas (what-reason, what-cause, what-effect, etc.). Accordingly, the same archetypal question may be rewritten in several different ways, as “why” can be rewritten in “what is the reason” or “what is the cause”. In other terms, archetypal questions identify generic explanations about a specific aspect to explain (e.g. a topic, an argument, a concept,etc.), in a given informative context.

For example, if the explanandum would be “heart diseases”, there would be many aspects involved including “heart”, “stroke”, “vessels”, “diseases”, “angina”, “symptoms”, etc. Some archetypal questions in this case might be “What is an angina?” or “Why a stroke?”.

An answer to an archetypal question is said to be an archetypal explanation

. Being an archetypal question a generic question requiring a generic answer, an archetypal explanation summarises the information given as answer to other punctual (non-archetypal) questions posed by (possibly) many different explainees, also in different moments in time.

Hence, drawing from Carnap’s criteria of adequacy of explanation novaes2017carnapian and from the interpretation of Achinstein’s theory given by sovrano2021philosophy, we define the Degree of eXplainability (DoX), and consequently also explainability, as follows:

Definition 2 (Explanatory Illocution and Degree of eXplainability)

Assuming that the content of information is correct, explainability is a property that information possesses and it can be measured in terms of Explanandum Aspects Coverage and Explanatory Illocution:

  • Explanandum Aspects Coverage: let be the set of aspects contained in that information and the set of relevant aspects to be explained about an explanandum, then the coverage is the set of explanandum’s aspects that are covered by that information, while the inverse-coverage is the set of uncovered aspects.

  • Explanatory Illocution: an estimate of how pertinently and how in detail that information can answer to a set of pre-defined archetypal questions on an explanandum aspect. Let be the set of all the details contained in that information, and the pertinence of a detail to an archetypal question about an aspect . Let also be a pertinence threshold in , then the Explanatory Illocution for is the set .

Consequently the Degree of eXplainability (DoX) and Weighted Degree of eXplainability (WeDoX) are framed as follows:

  • DoX: the average Explanatory Illocution per archetype, on the whole set of relevant aspects to be explained.

  • WeDoX: the weighted sum of the pertinences of each archetype composing the DoX. The archetype weights used for weighting the DoX are a set of real numbers in , one per archetype, and they can be arbitrarily picked depending on the goals of the stakeholders (e.g. an explainee, a Data Controller, etc.).

Interestingly, the DoX, as we defined it, is akin to Carnap’s central criteria of adequacy of explanation novaes2017carnapian: 1. similarity to the explanandum, 2. exactness, 3. and fruitfulness777Simplicity is presented as being subordinate to the other 3 requirements novaes2017carnapian.. In fact, the number of relevant aspects covered by information and the amount of details it can provide roughly say how much similar the information is to the explanandum. While the estimate of the exactness of multiple archetypal explanations says how that information can be fruitful for the formulation of many other different explanations intended as the result of an illocutionary act of answering questions. Although, differently from Carnap, our understanding of exactness is not that of adherence to standards of formal concept formation brun2016explication, but rather that of being precise or pertinent enough as an answer.

3.1 The Weighted DoX

Despite all the good properties it has, the DoX by itself is not a metric. In fact, a metric is usually understood as a measure, a standard, a criterion, against which something can be judged. But the (average) Explanatory Illocution by itself cannot help in judging whether a collection of information has an higher degree of explainability than another one, because it is a multi-dimensional estimate where each dimension is a different archetype. This characteristic makes harder to say whether a DoX is greater than another.

For a proper metric, a mechanism is required for combining the pertinence of the DoX into a single score representing explainability. A naive approach might be to average all the pertinence estimates, but this would imply that all the archetypal questions and aspects would have the same weight, and this should not be the case. In fact, it appears that in literature there is a shared understanding of “why” explanations, as the most important in XAI, sometimes followed by “how”, “what for”, “what if” or, more generally, “what”. For example, according to dodge2019explaining, local “why” explanations are more effective in exposing fairness discrepancies between different cases, while global “how” explanations seem to render more confidence in understanding the model and generally enhance the fairness perception, reducing communication costs raymond2021agree.

Hence, we propose to summarise the DoX by means of a weighted combination of pertinence, where weights are pre-defined for each archetypal question and aspect, depending on the main goals of the system. Therefore, the resulting WeDoX is said to be goal-dependant and it can act as a metric to judge whether the explainability of a system is greater than, equal to, or lower than another.

3.2 Computing the Explanatory Illocution

Explanatory Illocution is an estimate of the level of pertinence and detail that information (e.g. a set of documents, a paragraph, one single web-page, the output of a XAI-based system, etc.) have in answering some pre-defined archetypal questions. Thus, in order to compute an Explanatory Illocution, both an automatism for estimating pertinence and a mechanism for identifying details are needed, together with the selection of suitable archetypes.

For estimating pertinence, plain Question Answering is probably the easiest approach to implement or fine-tune, being fully end-to-end, but it may be computationally unfeasible for computing the DoX of large-enough information and amounts of Explanandum Aspects. Therefore, an approach like the one used by sovrano2021philosophy, for archetypal question answering, is more suitable to our ends, allowing for both the identification of details (or information units) and the estimation of pertinence.

More precisely, sovrano2021philosophy use as information units a meaningful decomposition of grammatical dependency trees, so to empower the units with the smallest granularity of information. As consequence, using such sub-trees as information units guarantees:

  • A disentanglement of complex information bundles, into the most simple units, so to be able to correctly estimate the level of detail covered by the information pieces, as required by Definition 2.

  • A better identification of duplicated units scattered throughout the information pieces, so to avoid an over-estimation of the level of detail.

  • An easy way to understand whether an answer is invalid, as being totally contained in the question, hence forcing its pertinence to be zero.

Nonetheless, in order to compute the Explanatory Illocution, in addition to the mechanisms for estimating the pertinence and for identifying the set of details , we need also to define a set of archetypal questions for every aspect .

According to Definition 1, an archetypal question is a very generic question characterised by one or more interrogative formulas. Literature is plenty of different examples of such archetypal questions, and many of them are used to classify both semantic and discourse relations he2015question; fitzgerald2018large; michael2017crowdsourcing; pyatkin2020qadiscourse. Interestingly, it is possible to identify a sort of hierarchy or taxonomy of such archetypes, ordered by their intrinsic level of specificity. For example, the simplest interrogative formulas (made only of an interrogative particle, i.e. what, why, when, who, etc.) can be seen as the most generic archetypes. While the more complex and composite is the formula (i.e. what-for, what-cause, etc.), the more specific is the question.

Hence, we decided to consider as set of main archetypes all the interrogative formulas used by literature he2015question; fitzgerald2018large; michael2017crowdsourcing; pyatkin2020qadiscourse to classify semantic relations within discourse, being probably the simplest and most representative ones888It is not unlikely that different, more complex and specific interrogative formulas may be needed, depending on the most various reasons and scenarios for which it is required to compute the DoX.. The main archetypes coming from Abstract Meaning Representation theory michael2017crowdsourcing are:

  • What is X? (60.9% of the dataset presented by michael2017crowdsourcing)

  • Who X? (17.5%)

  • How X? (6.9%)

  • Where X? (5.0%)

  • When X? (4.3%)

  • Which X? (2.9%)

  • Whose X? (1.9%)

  • Why X? (0.6%)

We refer to these archetypes as the primary ones because they consist only of interrogative particles. While the main archetypes coming from PDTB-style discourse theory pyatkin2020qadiscourse, also called secondary archetypes, are:

  • In what manner X? (25% of the dataset presented by pyatkin2020qadiscourse),

  • What is the reason X? (19%)

  • What is the result of X? (16%)

  • What is an example of X? (11%)

  • After what X? (7%)

  • While what X? (6%)

  • In what case X? (3%)

  • Despite what X? (3%)

  • What is contrasted with X? (2%)

  • Before what X? (2%)

  • Since when X? (2%)

  • What is similar to X? (1%)

  • Until when X? (1%)

  • Instead of what X? (1%)

  • What is an alternative to X? ()

  • Except when X? ()

  • Unless what X? ()

The remaining secondary archetypes we selected are:

  • What if X? (overlapping with “What is the result of X?”)

  • What is X for? (overlapping with “What is the reason X?”)

  • How much X?

  • Who by X?

4 Related Works

In literature we found only one mechanism claiming to be used as a metric for XAI. The mechanism proposed in hoffman2018metrics is clearly based on standard evaluations of usability, measuring explainability indirectly, through the effects that the resulting explanations have on the subjects. More precisely, hoffman2018metrics ’s metric is mainly inspired by the interpretation of explanations given by Cognitive Science, requiring to measure:

  • the subjective goodness of explanations,

  • whether users are satisfied by explanations,

  • how well users understand the AI systems,

  • how curiosity motivates the search for explanations,

  • whether the user’s trust and reliance on the AI are appropriate,

  • how the human-XAI work system performs.

In other terms, the metric presented by hoffman2018metrics is heavily relying on subjective measurements. Differently, our Weighted Degree of eXplainability (WeDoX) is a fully objective metric, so that it can be used to understand whether the amount of explainability is objectively poor, even if the resulting explanations are perceived as satisfactory and good by the explainees.

We deem that this characteristic of DoX is very important, in fact if explanations are built over explainable information, a poor degree of explainability objectively implies poor explanations, no matter how good the adopted explanatory process is (perceived): “Users also do not necessarily perform better with systems that they prefer and trust more. To draw correct conclusions from empirical studies, explainable AI researchers should be wary of evaluation pitfalls, such as proxy tasks and subjective measures” buccinca2020proxy.

5 Experiments for Verifying the Proposed Solution

To verify hypothesis 1 we designed a few experiments, to see whether our understanding and implementation of an algorithm for computing the Degree of eXplainability (DoX) is really aligned with the characteristics and nuances of explainability itself. To do so, we consider two standard XAI-based systems:

  • A heart disease predictor based on XGBoost chen2016xgboost and TreeSHAP lundberg2020local.

  • A credit approval system based on a simple Artificial Neural Network and on CEM dhurandhar2018explanations.

Normally, the output of these systems is always the combination of information coming from multiple sources, including:

  1. The XAI: input, output, involved logic.

  2. The AI wrapped by the XAI: input, output, involved logic.

  3. Documentation to guarantee the readability of the previous two types of information, to the users of the system.

  4. Information to pursue explanatory goals out of the reach of the AI, i.e. other informative contents requested by the users of the system, by regulators and/or ethics experts, etc..

Therefore, the explainability of such XAI-based systems is the explainability of the aforementioned information, provided (as readable output) by the system. Intuitively, the mere black-box AI is not much explainable because it does not answer many (archetypal) questions, i.e. why, when, how, etc.. This is why different XAIs intervene, introducing more meaningful information, illocutionarily covering relevant aspects to explain.

We previously argued that the degree of explainability of a XAI-based system can be measured in terms of Explanatory Illocution on a set of chosen Explanandum Aspects, as per the definition given in Section 3. In order to verify this assertion, we need to show that there is a strong correlation between our DoX and the perceived amount of explainability.

So far, measuring explainability directly is not possible without a metric as the one we are proposing, except for a few naive cases. One of these cases is surely when a simple XAI-based system is considered. In fact, in a simple XAI-based system, the amount of explainability is by design, clearly and explicitly dependent on the output of the underlying XAI, for the black-box not being explainable by nature. So that, by masking the XAI’s output, the overall system can be forced to be not explainable enough. This characteristic can be exploited to partially verify hypothesis 1, but not in a generic way because this type of verification is based on a comparison with lack of explainability and not different degrees of it.

Regardless, we ran a 1st experiment to see whether adding a XAI to a black-box AI does increase our estimate of DoX. This 1st experiment is presented in the following section, preceded by a thorough description of the XAI-based systems, and followed by a 2nd experiment and a few discussions.

5.1 How DoX is Computed

Throughout all these experiments we study how DoX varies under different circumstances. To this end, we implemented the algorithm for estimating DoX by using the same code for Aspect Overviewing described by sovrano2021philosophy and presented in Section 3.

Being our computation of DoX dependent on the output of a deep language model (used for estimating answers’ pertinence), we decided to consider more models during our experiments. In fact, assuming reasonable and acceptable performance for pertinence estimation on state-of-the-art benchmarks karpukhin2020dense; yang2019multilingual, we believe that the results of the computation of DoX should be consistent across the adopted deep language models. Hence the models we considered are:

  • FB: published by karpukhin2020dense and reimers2019sentence, and trained on the combination of the following datasets: Natural Questions kwiatkowski2019natural, TriviaQA joshi2017triviaqa, WebQuestions berant2013semantic, and CuratedTREC baudivs2015modeling.

  • TF: or Multilingual Universal Sentence Encoder yang2019multilingual

    and trained on the Stanford Natural Language Inference (SNLI) corpus

    bowman2015large.

Furthermore, for computing the WeDoX, in all the experiments we arbitrarily considered as set of archetype weights the following one, centred on “why” and “how” questions999Archetypes not present in are by default set to 0.:

  • why: 1

  • how: 0.9

  • what-for: 0.75

  • what: 0.75

  • what-if: 0.6

  • when: 0.5

5.2 The XAI-Based Systems

The credit approval system is the same used also in sovrano2021philosophy, designed by IBM to showcase its XAI library: AIX360. This explanandum is about finance and the system is used by a bank. This bank deploys an Artificial Neural Network to decide whether to approve a loan request, and it uses the CEM algorithm to create post-hoc contrastive explanatory information. This information is meant to help the customers, showing them what minimal set of factors is to be manipulated for changing the outcome of the system from denial to approval (or vice-versa).

The Artificial Neural Network was trained on the “FICO HELOC” dataset holterfico. The FICO HELOC dataset contains anonymized information about Home Equity Line Of Credit (HELOC) applications made by real homeowners. A HELOC is a line of credit typically offered by a US bank as a percentage of home equity. The Artificial Neural Network is trained to properly answer the following question: “What is the decision on the loan request of applicant X?”.

Given the specific characteristics of this system, it is possible to assume that the main goal of its users is about understanding what are the causes behind a loan rejection and what to do to get the loan accepted. This is why the output of CEM is designed to answer the questions:

  • What are the easiest factors to consider in order to change the result of applicant X’s application?

  • How factor F should be modified in order to change the result of applicant X’s application?

  • What is the relative importance of factor F in changing the result of applicant X’s application?

Nonetheless many other relevant questions might be to answer before the user is satisfied, reaching its goals. These questions include: “How to perform those minimal actions?”, “Why are these actions so important?”, etc..

On the other hand, the heart disease predictor is a completely new explanandum we designed specifically for the purposes of this paper. This explanandum is about health and the system is used by a first level responder of a help-desk for heart disease prevention. The systems uses XGBoost chen2016xgboost to predict the likelihood of a patient having a heart disease given its demographics (gender and age), health (diastolic blood pressure, maximum heart rate, serum cholesterol, presence of chest-pain, etc.) and the electrocardiographic (ECG) results. This likelihood is classified into 3 different risk areas: low (probability of heart disease below 0.25), medium () or high. XGBoost is used to answer the following questions:

  • How is likely that patient X has a heart disease?

  • What is the risk of heart disease for patient X?

  • What is the recommended action, for patient X to cure or prevent a heart disease?

The dataset used to train XGBoost is the “UCI Heart Disease Data”detrano1989international; alizadehsani2019database. TreeSHAPlundberg2020local, a famous XAI algorithm specialised on tree ensemble models (i.e. XGBoost) for post-hoc explanations is used to understand what is the contribution of each feature to the output of the model (XGBoost). TreeSHAP can be used to answer the following questions:

  • What would happen if patient X would have factor Y (e.g. chest-pain) equal to A instead of B?

  • What are the most important factors contributing to the predicted likelihood of heart disease, for patient X?

  • How factor Y contributes to the predicted likelihood of heart disease, for patient X?

The first level responder is responsible for handling the patient’s requests for assistance, forwarding them to the right physician in the eventuality of a reasonable risk of heart disease. First level responders get basic questions from callers, they are not doctors but they have to decide on the fly whether the caller should speak to a real doctor or not. So they quickly use the XAI system to figure out what to answer to the callers and what are the next actions to suggest.

This system is used directly by the responder, and indirectly by the caller through the responder. These two types of users have different but overlapping goals and objectives. It is reasonable to assume that the goal of the responders is to answer in the most efficient and effective way the questions of the callers. To this end, the questions answered by TreeSHAP are quite useful, but many other important questions should also be answered, including: “What is the easiest thing that the patient could actually do to change his heart disease risk from medium to low?”, “How could the patient avoid raising one of the factors, preventing his heart disease risk to raise?”, etc..

Both these two systems are an example of Normal XAI-only Explainer (NXE), a One-Size-Fits-All explanatory mechanism providing the bare output of the XAI as fixed explanation for all users, together with the output of the wrapped Artificial Intelligence (AI), a few extra details to ensure the readability of the results, and a minimum of context. In the case of the heart disease predictor, NXE consists of:

  • Context: a titled heading section kindly introducing the responder (the user) to the system.

  • AI Inputs: a panel for inserting the patient’s parameters.

  • AI Outputs: a section displaying the likelihood of heart disease estimated by XGBoost and a few generic suggestions about the next actions to suggest.

  • XAI Outputs: a section showing the contribution (positive or negative) of each parameter to the likelihood of heart disease, generated by TreeSHAP.

A screenshot of NXE for the heart disease predictor is shown in figure 1.

Figure 1: Heart Disease Predictor & NXE: A screenshot of the NXE explanatory tool for the heart disease predictor.

In the case of the credit approval system, NXE consists of:

  • Context: a titled heading section kindly introducing Mary (the user) to the system.

  • AI Output: the decision of the Artificial Neural Network for the loan application. This decision normally can be “denied” or “accepted”. For Mary it is: “denied”.

  • XAI Output: a section showing the output of CEM. This output consists in a minimal ordered list of factors that are the most important to change for the outcome of the AI to switch.

A screenshot of NXE for the credit approval system is shown in figure 2.

Figure 2: Credit Approval System & NXE: A screenshot of the NXE explanatory tool for the credit approval system.

5.3 1st Experiment: Normal XAI-based Explanations

The 1st experiment is meant to shed more light on how a few changes to the explainability of a system affect the estimated DoX. Specifically, XAI-based systems are considered for this experiment, instead of other AI-based systems, because their amount of explainability is by design, clearly and explicitly dependent on the output of the underlying XAI. So that, by masking the XAI’s output, the overall system can be forced to be less explainable. Hence, this characteristic can be exploited to (at least partially) verify hypothesis 1.

The overall idea of this experiment is that if we compare the DoX of a NXE’s output to the DoX of that same information without the XAI’s outcome (namely a NAE, as Normal AI-based Explanation), the NXE’s DoX should be clearly higher than NAE’s.

For computing the DoX, as set of Explanandum Aspects we take those targeted by the respective XAIs. The main Explanandum Aspects targeted by XGBoost chen2016xgboost and TreeSHAP lundberg2020local in the Heart Disease Predictor (HD) are 5:

  • The recommended action for patient X

  • The most important factors that contribute to predict the likelihood of heart disease

  • The likelihood of heart disease

  • The risk R of having a heart disease

  • The contribution of Y to predict the likelihood of heart disease for patient X

While the main Explanandum Aspects targeted by the Artificial Neural Network and CEM dhurandhar2018explanations in the Credit Approval System (CA) are 4:

  • The easiest factors to consider for changing the result

  • The relative importance of factor F in changing the result of applicant X’s application

  • Applicant X’s risk performance

  • The result of applicant X’s application

After properly converting the images produced by the NXE to textual explanations, the resulting Explanandum Aspects Coverage on NXE for both HD and CA is 100%, while on NAE is of 60% for HD and of 50% for CA. Furthermore, being TF and FB different in estimating pertinence, we found that different pertinence thresholds had to be considered. Therefore, we set FB’s pertinence threshold to , and TF’s to .

Computing the DoX, we got the results displayed in Table 1, where for simplicity we show only the primary archetypes. Interestingly, in all the estimates of DoX obtained with TF, the “what” archetype is the most represented. We believe this is because TF is more sensitive to archetypes overlapping, i.e. many “why” questions can be easily rewritten with “what”, etc.

CA HD
NAE NXE NAE NXE
FB 2.80 8.16 16.31 24.18
WeDoX TF 11.92 20.84 14.03 24.02
FB
"how": 0.65
"which": 0.644
"whose": 0.637
"what": 0.622
"who": 0.617
"when": 0.614
"where": 0.601
"why": 0.584
"whose": 1.891
"how": 1.843
"why": 1.829
"which": 1.821
"where": 1.598
"when": 1.597
"what": 1.572
"who": 1.386
"which": 5.293
"what": 4.591
"how": 4.350"
whose": 2.430"
when": 2.309
"why": 2.124
"where": 2.094
"who": 2.083
"what": 6.72
"which": 6.534
"how": 5.636
"whose": 4.164
"why": 3.871
"where": 3.811
"when": 3.508
"who": 3.254
DoX TF
"what": 6.072
"when": 1.446
"which": 1.289
"how": 0.99
"where": 0.774
"why": 0.709
"whose": 0.19
"who": 0.079
"what": 9.127
"when": 3.377
"which": 3.216
"how": 2.2
"whose": 2.156
"why": 2.058
"where": 1.564
"who": 1.393
"what": 5.564
"which": 2.04
"whose": 1.946
"why": 1.89
"when": 1.652
"where": 1.319
"who": 1.312
"how": 1.208
"what": 7.131
"which": 4.94
"why": 4.111
"how": 3.489
"where": 3.367
"when": 3.356
"whose": 2.994
"who": 2.76
Table 1: Experiment 1 - Degree of eXplainability: in this table DoX and WeDoX are shown for the Credit Approval System (CA) and the Heart Disease Predictor (HD). As columns we have the different explanatory mechanisms used for experiment 1: NAE and NXE. As rows we have different explainability estimates (DoX and WeDoX) using different deep language models for computing pertinence: FB and TF. For simplicity, with DoX we show only the primary archetypes.

As expected, on both the XAI-based systems, the results of the 1st experiment neatly show that NXE’s WeDoX is way higher than NAE’s, regardless the adopted deep language model.

Nonetheless, considering that in this 1st experiment we arbitrarily picked a simple set of Explanandum Aspects, what would happen if we would consider different and more complex explicanda and explanatory contents? Furthermore, the result of the experiment is based on the comparison of the DoX of a non-explainable system (as NAE) with an explainable system, and this is a very peculiar and naive case to consider.

Therefore, in order to fully verify hypothesis 1 we need to understand whether DoX is behaving properly also when explainability is present in different, non-zero, amounts. To do so, we envisage that explainability can be measured indirectly, by studying the effectiveness of the resulting explanations. In fact, we have that more explainability implies a greater ability to explain, therefore more explanations.

In short, the lower is the DoX, the less explanations can be produced, the less effective is likely to be an explainee on the tasks related to the explanandum. So, if hypothesis 1 is correct, once fixed all the components that may affect effectiveness, including the presentation logic (the mechanism for re-elaborating explainable information into explanations) and the explanandum, an increase in the DoX of the (explanatory) system should always correspond to a proportional increase of its effectiveness, at least in those tasks covered by the information provided by the increment of DoX.

To verify this, we performed a user-study on the 2 XAI-based systems. As discussed also in the following sections, the goal of the study was to understand whether, adding more explainable information to NXE and enlarging the scope of the explanandum, DoX is still behaving as expected.

Hence, for explaining the two XAI-based systems and their outcomes to the participants, verifying the main hypothesis, we used two different sources of information. The first one is NXE’s output, while the second one is the output of NXE connected to a 2nd (non-expandable) level of information consisting in an exhaustive and verbose set of autonomous static explanatory resources in the form of web-pages. We will refer to this second source of explainable information as a 2nd-Level Exhaustive Explanatory Closure (2EC, in short).

Figure 3: Heart Disease Predictor & 2EC: A screenshot showing the connection between the 1st and the 2nd explanatory levels of 2EC on the heart disease predictor.

The connection between this 2nd level of information and the 1st level is simply a list of hyper-links to the autonomous resources, appended to the output of NXE, as shown in figure 3 for the Heart Disease Predictor.

Compared to NXE, 2EC provides much longer explanations, with no structure and no re-organization, besides an automatically created table of content allowing the user to move from the 1st explanatory level to the 2nd. The information presented at this 2nd level is the content of several resources (e.g. a few hundred web-pages) carefully selected to cover as much as possible of the explanandum topics.

For the credit approval system we take 58 web-pages, 50 of which come from MyFICO101010https://www.myfico.com (the main resource about FICO scores), while the remaining come from Forbes111111https://www.forbes.com, Wikipedia, AIX360121212http://aix360.mybluemix.net, and BankRate131313https://www.bankrate.com. For the heart disease predictor we take 103 web-pages, 75 of which come from the website of the U.S. Centers for Disease Control and Prevention141414https://www.cdc.gov, while the remaining come from the American Heart Association151515https://www.heart.org, Wikipedia, MedlinePlus161616https://medlineplus.gov, MedicalNewsToday171717https://www.medicalnewstoday.com and other minor sources.

We take way more information (almost the double) for the heart disease predictor because, intuitively, it is a more complex explanandum than the credit approval system, usually requiring much more questions to be covered with different levels of detail.

5.4 The user-study

To verify our expectations we designed and then ran a user-study involving 69 participants among the students of our university. These students came from a few different courses of study181818Only the master degrees are international, with students from different countries and English teachings.:

  • Bachelor Degree in Computer Science

  • Bachelor Degree in Management for Informatics

  • Master Degree in Digital Humanities

  • Master Degree in Artificial Intelligence

The user-study consists in a quiz per XAI-based system, for measuring effectiveness in a few pre-defined tasks in the form of questions. In other terms, each question in the quizzes is meant to represent an informative goal for one or more users of the systems. Being impossible and unfeasible to identify all the possible questions a real user would ask to reach its goal, we decided to select a few representative ones for the sake of the study.

So, we picked different types of questions, with different archetypes and complexities, and for each question we selected 4 to 8 plausible answers of which only one is (the most) correct. Questions were selected in accordance to the expected users’ goals, in fact, both the heart disease predictor and the credit approval system have different but well-defined purposes.

The heart disease predictor is designed to facilitate a responder in predicting the likelihood of heart disease of a caller, suggesting the next concrete actions to take (i.e. a test, a new habit, etc.) to treat or avoid the disease, in accordance with the biological factors of the caller. Therefore, the questions we selected for the quiz on the heart disease predictor are:

  1. What are the most important factors leading that patient to a medium risk of heart disease?: a ’what’,’why’ question with answer in the NXE.

  2. What is the easiest thing that the patient could actually do to change his heart disease risk from medium to low?: a ’what’,’how’ question with answer in NXE.

  3. According to the predictor, what level of serum cholesterol is needed to shift the heart disease risk from medium to high?: a ’what’,’how’ question with answer in NXE.

  4. How could the patient avoid raising bad cholesterol, preventing his heart disease risk to shift from medium to high?: a ’how’ question with no answer in NXE.

  5. What kind of tests can be done to measure bad cholesterol levels in the blood?: a ’what’,’how’ question with no answer in NXE.

  6. What are the risks of high cholesterol?: a ’what’,’why-not’ question with no answer in NXE.

  7. What is LDL?: a ’what’ question with no answer in NXE.

  8. What is Serum Cholestrol?: a ’what’ question with no answer in NXE.

  9. What types of chest pain are typical of heart disease?: a ’what’,’how’ question with no answer in NXE.

  10. What is the most common type of heart disease in the USA?: a ’what’ question with no answer in NXE.

  11. What are the causes of angina?: a ’what’,’why’ question with no answer in NXE.

  12. What kind of chest pain do you feel with angina?: a ’what’,’how’ question with no answer in NXE.

  13. What are the effects of high blood pressure?: a ’what’,’why-not’ question with no answer in NXE.

  14. What are the symptoms of high blood pressure?: a ’what’,’why’ question with no answer in NXE.

  15. What are the effects of smoking to the cardiovascular system?: a ’what’,’why-not’ question with no answer in NXE.

  16. How can the patient increase his heart rate?: a ’how’ question with no answer in NXE.

  17. How can the patient try to prevent a stroke?: a ’how’ question with no answer in NXE.

  18. What is a Thallium stress test?: a ’what’,’why’ question with no answer in NXE.

Interestingly, many questions are polyvalent in the sense that they can be rewritten using different archetypes. For example the question “Why, in terms of factor, that patient has a medium risk of heart disease?” can be rewritten as “What are the most important factors leading that patient to a medium risk of heart disease?”, or the question “How can an account become delinquent?” in “Why does an account become delinquent?”.

On the other side, the credit approval system is designed to help an applicant (i.e. Mary) to understand the result of its loan application, how to concretely change it, what to do to get the loan accepted instead of denied. Therefore, the questions we selected for the quiz on the credit approval system are191919It is important to note that the last two questions are about the specific technology used by the system. In fact, in this specific context, the data subject (the loan applicant) should be aware of the technological limitations and issues of the automated decision maker (the credit approval system), as suggested by the GDPR.:

  1. What did the Credit Approval System decide for Mary’s application?: a ’what’,’how’ question with answer in NXE.

  2. What is an inquiry (in this context)?: a ’what’ question with no answer in NXE.

  3. What type of inquiries can affect Mary’s score, the hard or the soft ones?: a ’what’,’how’ question with no answer in NXE.

  4. What is an example of hard inquiry?: a ’what’ question with no answer in NXE.

  5. How can an account become delinquent?: a ’how’,’why’ question with no answer in NXE.

  6. Which specific process was used by the Bank to automatically decide whether to assign the loan?: a ’what’,’how’ question with answer in NXE.

  7. What are the known issues of the specific technology used by the Bank (to automatically predict Mary’s risk performance and to suggest avenues for improvement)?: a ’what’,’why’ question with no answer in NXE.

We tried to keep the size of the two quizzes proportional to the complexity and richness of the explananda. Intuitively, the heart disease predictor is a much more complex explanandum with many more resources and questions to answer.

Participants were randomly allocated to test only NXE or 2EC, but on both the XAI-based systems (starting from the credit approval system, the simplest one). Despite this, many participants refused to test the heart disease predictor because too burdensome in terms of minimum time required to complete the quiz. During the user-study we used the first question of the credit approval system as attention check, discarding the participants that failed at it. In fact, we expect that for answering that question it is sufficient to read the very first lines of the initial explanation. Therefore, people failing to answer this first question is likely to be answering (more or less) randomly/nonsensically, paying no attention to the task.

For the credit approval system (CA) we got 39 participants:

  • NXE: 19 participants, but 2 did not pass the attention check.

  • 2EC: 20 participants, but 2 did not pass the attention check.

For the heart disease predictor (HD) we got 30 participants:

  • NXE: 14 unique participants, but 1 did not pass the attention check.

  • 2EC: 16 participants, but 1 did not pass the attention check.

For answering the effectiveness quizzes, participants were repeatedly asked to use only the information reachable from within the systems (i.e. by following the external hyper-links in there). In other terms, they were clearly instructed to not use Google or other external tools for answering. Participants were also:

  • Instructed to click on “I don’t know” in case they do not know an answer.

  • Informed that there is only one correct answer for each question and when multiple answers seem to be reasonably correct, only the most precise is considered to be the correct one.

  • Noticed when a wrong answer was given, showing them the correct one, in order to make them aware of their success or failure in reaching a goal. Questions were shown in order, one by one, separately, and the possible answers were randomly shuffled.

At the end of the effectiveness quiz the given answers were automatically scored as correct (score 1) or not (score 0). For example for the question “What did the Credit Approval System decide for Mary’s application?” the correct answer is “It was rejected” and wrong answers are “Nothing” or “I don’t know”.

5.5 2nd Experiment: Explainability vs Effectiveness

This 2nd experiment aims at understanding whether changing the DoX implies a change also to the effects of explainability on the explainee. We do it by comparing 2EC’s DoX and NXE’s and the effectiveness scores measured by the user-study previously introduced. Considering that the amount of information handled by NXE is roughly of 2EC’s, before even computing the DoX, we can say that NXE’s explainability is not none and it is lower than 2EC’s. Hence, we would expect the DoX of NXE being lower than 2EC’s.

What we can see is that both NXE and 2EC have similar ways to present different amounts of information, hence their presentation logic (one of the components directly affecting effectiveness) can be considered as fixed. By fixing also the set of Explanandum Aspects, what we expect from hypothesis 1, by comparing the effectiveness of NXE with 2EC’s, is that the higher is the DoX, the higher is also the explanatory effectiveness of the system, at least in those tasks covered by the information provided by the increment of DoX.

For computing the DoX, as set of Explanandum Aspects we take all those covered by both the questions and answers within the user-study quizzes. We were able to identify 82 main relevant aspects for the 2EC of the Heart Disease Predictor (HD) and 40 for the Credit Approval System (CA). Computing the DoX we got the results displayed in Table 2, confirming our expectations for them.

CA HD
NXE 2EC NXE 2EC
FB 15.41 66.49 6 58.19
WeDoX TF 11.49 55.84 16.15 70.24
FB
"which": 4.024
"when": 3.896
"why": 3.468
"how": 3.415
"who": 2.591
"what": 2.533
"whose": 2.028
"where": 1.706
"why": 16.143
"when": 15.429
"how": 15.096
"where": 14.440
"which": 14.04
"who": 13.641
"whose": 13.529
"what": 11.737
"which": 1.151
"why": 1.133
"what": 1.125
"how": 0.874
"whose": 0.586
"when": 0.56
"who": 0.557
"where": 0.555
"when": 13.18
"why": 12.799
"how": 12.562
"what": 12.199
"which": 12.149
"whose": 11.625
"who": 11.617
"where": 10.084
DoX TF
"what": 6.027
"why": 1.551
"which": 1.503
"where": 0.876
"how": 0.67
"when": 0.596
"whose": 0.313
"who": 0.0
"what": 15.487
"which": 11.322
"why": 11.222
"how": 10.991
"where": 10.507
"whose": 10.310
"when": 9.933
"who": 9.888
"what": 8.139
"why": 1.545
"which": 1.170
"when": 0.794
"how": 0.495
"whose": 0.477
"who": 0.414
"where": 0.389
"what": 19.311
"which": 14.558
"how": 14.161
"when": 14.043
"why": 13.748
"where": 13.359
"whose": 12.537
"who": 11.823
Table 2: Experiment 2 - Degree of eXplainability: in this table DoX and WeDoX are shown for the Credit Approval System (CA) and the Heart Disease Predictor (HD). As columns we have the different explanatory mechanisms used for experiment 2: NXE and 2EC. As rows we have different explainability estimates (DoX and WeDoX) using different deep language models for computing pertinence: FB and TF. For simplicity, with DoX we show only the primary archetypes.

On the other side, the results obtained with the user-study clearly show 2EC with an higher median effectiveness than NXE. From these results, we can see that HD’s quiz was the most difficult for the participants, because nobody managed to answer correctly more than 14 of the 18 questions, while CA’s maximum effectiveness score was 100%.

In figure 4 is shown a consistent increment in median effectiveness of 2EC on those questions which aspects are not covered by the information presented with NXE. This increment is of 13.33% in HD and 10% in CA, and it is higher in HD probably because its delta WeDoX (the difference between NXE’s and 2EC’s) is larger, for both FB and TF, than CA’s.

We performed a one-sided Mann-Whitney U-Test (a non-parametric version of the t-test for independent samples) under the alternative hypothesis that NXE’s effectiveness is stochastically less than 2EC’2. The results confirmed that, at least for HD, there is significant statistical evidence to support the fact that 2EC’s scores are greater than NXE’s not just by chance. In fact we obtained a p-value equal to

202020A is normally considered to be a significant statistical evidence. with for HD, and with for CA.

Figure 4: NXE vs 2EC - Effectiveness Scores on the Questions not covered by NXE: Comparison of NXE’s results (the blue ones) with 2EC’s (the orange ones), only on those questions which aspects are not covered

by the information presented with NXE. Results are shown in the form of box plots (25th, 50th, 75th percentile, and whiskers covering all data and outliers). The numerical value of medians is shown inside pink boxes. The 1st column is for the heart disease predictor (HD), while the 2nd for the credit approval system (CA).

Unexpectedly, we can see in figure 5 an important increment in effectiveness of 2EC also on those questions (questions 1 and 6 in the CA quiz and questions 1, 2 and 3 in the HD quiz) which aspects are specifically covered by NXE’s information. This increment is even higher than before and it is of 33% in HD and 25% in CA.

Figure 5: NXE vs 2EC - Effectiveness Scores on the Questions covered by NXE: Comparison of NXE’s results (the blue ones) with 2EC’s (the orange ones), only on those questions (questions 1 and 6 in the CA quiz and questions 1, 2 and 3 in the HD quiz) which aspects are specifically covered by NXE’s information. Results are shown in the form of box plots (25th, 50th, 75th percentile, and whiskers covering all data and outliers). The numerical value of medians is shown inside pink boxes. The 1st column is for the heart disease predictor (HD), while the 2nd for the credit approval system (CA).

These results are suggesting that hypothesis 1 holds. In fact, 2EC’s increase in Explanatory Illocution is neat across all archetypes, consistently with its expected increase in effectiveness, regardless the presentation logic clearly not being user-centred, by dumping on the user dozens and dozens of pages of information.

Interestingly, the increment in terms of cumulative pertinence of 2EC’s Explanatory Illocution is mainly on the same archetypes covered by the quizzes’ questions, giving a further idea of how the DoX can be used to analyse the explainability of a system given a pre-defined explanandum and a set of goals (encoded by the quiz’s questions) for the user to reach. In fact, if the goal of the system is about conveying certain specific explanatory contents (characterised by one or more archetypes), being able to quantify how much of these contents are approximately covered can definitely help to understand how to intervene for improving the explainability and effectiveness of the system.

6 Conclusions

The long-term goal of this paper is to change and improve the interaction between organisations and individuals, by the automated assessment of the Degree of eXplainability (DoX) of AI-based systems or (more generally) explainable information. This is why we described an algorithm for objectively quantifying the DoX of information, by estimating the number and quality of the explanations it could generate on the most important aspects to be explained.

In order to understand whether the DoX is actually behaving as explainability is expected to, we designed a few experiments on two realistic AI-based systems for heart disease prediction and credit approval, involving famous AI technology as Artificial Neural Networks, TreeSHAP lundberg2020local, XGBoost chen2016xgboost and CEM dhurandhar2018explanations. The results we obtained show that the DoX is aligned to our expectations, and it is possible to actually quantify explainability in natural language information.

Surely this does not imply that an estimate of the DoX, alone, is enough for a thorough impact assessment under the law. For example, starting from the point that explainable information (e.g. an explanation) can be incorrect, our definition of DoX does not consider the degree of correctness of information, assuming that truth is given and that it is a different thing from explainability. Anyway, we believe that this technology might be used for an Algorithmic Impact Assessment (AIA), as soon as a set of relevant Explanandum Aspects can be identified under the requirements of the law. Therefore, being able to select a reasonable threshold of explainability for law-compliance is certainly one of the next challenges we envisage for a proper standardisation of explainability in the industrial panorama.

Acknowledgements.
We would like to thank all the students that agreed to participate to the user-study, for free.

References