Guidelines and evaluation for clinical explainable AI on medical image analysis

Explainable artificial intelligence (XAI) is essential for enabling clinical users to get informed decision support from AI and comply with evidence-based medical practice. Applying XAI in clinical settings requires proper evaluation criteria to ensure the explanation technique is both technically sound and clinically useful, but specific support is lacking to achieve this goal. To bridge the research gap, we propose the Clinical XAI Guidelines that consist of five criteria a clinical XAI needs to be optimized for. The guidelines recommend choosing an explanation form based on Guideline 1 (G1) Understandability and G2 Clinical relevance. For the chosen explanation form, its specific XAI technique should be optimized for G3 Truthfulness, G4 Informative plausibility, and G5 Computational efficiency. Following the guidelines, we conducted a systematic evaluation on a novel problem of multi-modal medical image explanation with two clinical tasks, and proposed new evaluation metrics accordingly. The evaluated 16 commonly-used heatmap XAI techniques were not suitable for clinical use due to their failure in G3 and G4. Our evaluation demonstrated the use of Clinical XAI Guidelines to support the design and evaluation for clinically viable XAI.

READ FULL TEXT VIEW PDF

page 8

page 10

page 15

03/12/2022

Evaluating Explainable AI on a Multi-Modal Medical Imaging Task: Can Existing Algorithms Fulfill Clinical Requirements?

Being able to explain the prediction to clinical end-users is a necessit...
10/04/2018

Developing Design Guidelines for Precision Oncology Reports

Precision oncology tests that profile tumors to identify clinically acti...
06/08/2020

Evaluation Criteria for Instance-based Explanation

Explaining predictions made by complex machine learning models helps use...
12/10/2020

AI Driven Knowledge Extraction from Clinical Practice Guidelines: Turning Research into Practice

Background and Objectives: Clinical Practice Guidelines (CPGs) represent...
06/23/2020

CrossPath: Top-down, Cross Data Type, Multi-Criterion Histological Analysis by Shepherding Mixed AI Models

Data-driven AI promises support for pathologists to discover sparse tumo...
03/25/2022

A Meta Survey of Quality Evaluation Criteria in Explanation Methods

Explanation methods and their evaluation have become a significant issue...
07/11/2021

One Map Does Not Fit All: Evaluating Saliency Map Explanation on Multi-Modal Medical Images

Being able to explain the prediction to clinical end-users is a necessit...

1 Introduction

Suppose an artificial intelligence (AI) developer Alex is developing a clinical AI system, and she wants to select an explainable AI (XAI) technique to make the AI model interpretable and transparent to clinical users. Given there are numerous AI explainability techniques available, Alex may ask: How can I choose an AI explainability technique that is the optimal for my target clinical task? She may look up literature on XAI evaluation [Sokol2020, 10.1145/3387166, VILONE202189, 8400040, DBLP:journals/corr/abs-1806-00069] hoping it will guide her selection on XAI techniques. The literature suggests various selecting criteria and computational- or human-level evaluation methods. But since Alex is building a clinical AI system which will assist doctors in life-or-death decisions, she may ask, Is it clinically viable to use these evaluation metrics? Will they help to meet doctors’ clinical requirements on AI explanation? How to prioritize multiple evaluation objectives for clinical XAI systems?

Alex’s questions are prevalent when applying or proposing explainable AI techniques for clinical use. As a fast-advancing technology, AI has transformative potential in many medical fields [Zhang2019, Fujisawa2018, Mohan2020]. Nonetheless, there are outstanding barriers to the widespread translation of AI from bench to bedside [He2019]. One of the major barriers is the model explainability or transparency problem [Jin_2020, Kelly2019]

: the decision process of the state-of-the-art AI technologies, i.e., deep neural networks (DNN), is not completely and intuitively comprehensible even to its human creators, due to its millions of parameters, complex feature representations in high-dimensional space, multiple layers of its decision process, and non-linear mappings from input space to output prediction.

AI developers, like Alex, resort to XAI techniques to explain AI decisions in human-understandable forms [doshivelez2017rigorous], and enable clinical users to make informed decisions with AI assistance that comply with evidence-based medical practice111“Evidence-based medicine is the conscientious, explicit, judicious and reasonable use of modern, best evidence in making decisions about the care of individual patients.” [Masic2008] [Sackett71]. Indeed, research has shown that explanations have the potential to help clinical users to verify AI’s decisions [Ribeiro2016b], resolve disagreements with AI during decision discrepancy [10.1145/3359206], calibrate their trust in AI assistance [7349687, Zhang2020], identify potential biases [Caruana2015], facilitate biomedical discoveries [Woo2017], meet ethical and legal requirements [Amann2020, gdpr], and ultimately facilitate doctor-AI communication and collaboration to leverage the strengths of both [2101.01524, Topol2019, Carter2017].

Applying XAI in clinical settings requires proper evaluation to ensure the explanation technique is both technically sound and clinically useful. Although existing works on XAI evaluation proposed many real-world application desiderata and evaluation metrics [Sokol2020, 10.1145/3387166, VILONE202189, 8400040, jacovi-goldberg-2020-towards, DBLP:journals/corr/abs-1806-08049, hase-bansal-2020-evaluating, doshivelez2017rigorous, DBLP:journals/corr/abs-1806-00069], there is not a canonical criterion on the goodness of explanation, and it is unknown which evaluation objectives are suitable for clinical applications. For the very limited emerging XAI evaluation works on medical image tasks, such as on retinal [10.1007/978-3-030-63419-3_3], endoscopic [DESOUZA2021104578], and chest X-Ray [Arun2021] imaging tasks, the evaluation mainly focused on one criterion, which is how well the explanation agrees with clinical prior knowledge, without justification for the selection of such criterion and its clinical applicability. This evaluation criterion may be confounded by factors outside XAI methods themselves, such as model training and spurious patterns in the data, as detailed in §2.2. Furthermore, there lack clear guidelines on which evaluation objectives should be applied and prioritized to correspond to clinical requirements on AI explanation.

To answer Alex’s questions and provide concrete support for the design and evaluation of clinical XAI, we propose the Clinical XAI Guidelines, which were developed with dual clinical and technical perspectives. The guidelines consist of five evaluating criteria: The form of explanation is selected based on Guideline 1 (G1) Understandability and G2 Clinical relevance. The specific explanation technique for the selected form is chosen based on G3 Truthfulness, G4 Informative plausibility, and operational considerations on G5 Computational efficiency. Following the guidelines, we conducted a systematic evaluation of 16 commonly-used feature attribution map (heatmap) techniques on two multi-modal medical image tasks. We also formulated a novel and clinically pervasive problem of multi-modal medical image explanation, which is a generalized form of single modal medical image explanation. We proposed the XAI evaluating metrics for this novel problem accordingly. The evaluation showed existing heatmap methods met G1, partially met G2. But they did not meet G3 and G4, which suggests they are not suitable for clinical use.

Figure 1: The Clinical Explainable AI Guidelines. Explainable AI algorithms should meet the five criteria in the guideline to be suitable for clinical use.
The evaluation results on 16 heatmap methods regarding the guidelines criteria are shown at the bottom.

Our key contributions are:

  1. We propose the Clinical XAI Guidelines to support the selection and design of clinically viable XAI techniques for medical imaging tasks.

  2. We conduct a systematic evaluation of multiple feature attribution map methods on two medical imaging tasks to give a wholistic evaluation of their adherence to the guidelines.

  3. Departing from the de-facto single modality explanation, we propose the clinically important but technically ignored problem of multi-modal medical image explanation and propose a novel metric: modality-specific feature importance (MSFI) to quantify and automate physician assessment of explanation plausibility.

Roadmap

The manuscript is organized as follows: we first present the clinical XAI guidelines in §2, with its key points highlighted in Table 1 and Fig. 1. We then present the systematic evaluation of existing 16 heatmap explanation methods based on the guidelines, with evaluation setup (§3), evaluation methods (§4), results (§5), and discussions (§6).

2 Clinical Explainable AI Guidelines

By leveraging collective expertise in AI, clinical medicine, and human factor analysis, we developed the Clinical XAI Guidelines based on a thorough physician user study, our pilot XAI evaluation experiments [aaai2022, DBLP:journals/corr/abs-2107-05047], and literature review. The physician user study was conducted with 30 neurosurgeons on a glioma grading XAI prototype (Fig 2). We collected physicians’ quantitative ratings on the heatmap explanation, and qualitative comments on the XAI system from the interview sessions and open-ended questionnaire. The qualitative data were used as the guidelines support from clinical aspect. The detailed user study findings and method are in Supplementary Material S1, and its related supporting sections were referred in the paper starting with ‘U’.

Next, we present the Clinical XAI Guidelines, which is a checklist of five evaluating objectives to optimize a clinical XAI technique. They are categorized into three considerations on clinical usability, evaluation, and operation. For each objective in the guidelines, we list its key references from our user study or literature. Ways of assessment are also described to help identify if the objective is met. The guidelines and their key points are summarized in Table 1. The full version of the guidelines is in Appendix Appendix: Clinical Explainable AI Guidelines.

Consideration Clinical XAI Guidelines Ways of Assessment Key References
Clinical Usability G1: Understandable
Explanations should be easily understandable by clinical users without requiring technical knowledge. Sketch explanation forms and show them to clinical users. [jin2021euca]; U3.3: Making AI transparent by providing information on performance, training dataset, and decision confidence
G2: Clinically relevant
Explanation should be relevant to physicians’ clinical decision-making pattern, and can support their clinical reasoning process. User study with clinical users, to inspect if the explanation corresponds to their clinical reasoning process. U2.2. Resolving disagreement; U3. Clinical requirements of explainable AI
Evaluation G3: Truthful
Explanations should truthfully reflect the AI model decision process. This is the prerequisite for G4. Cumulative feature removal/addition test [DBLP:journals/corr/abs-2104-08782, NEURIPS2019_a7471fdc, DBLP:conf/nips/HookerEKK19, 7552539, Lundberg2020, 10.5555/3327757.3327875]; Synthetic dataset [doshivelez2017rigorous, pmlr-v80-kim18d, DBLP:journals/corr/abs-1806-00069] [jacovi-goldberg-2020-towards, Sokol2020, DBLP:journals/corr/abs-2006-04948]; U2.3. Verifying AI decision, and calibrating trust
G4: Informative plausibility
Users’ judgment on explanation plausibility may inform users about AI decision quality, including potential flaws or biases. Statistical test on the correlation between AI decision quality measures and plausibility measures. [jacovi-goldberg-2020-towards]; U5. Clinical assessment of explainable AI
Operation G5: Fast
The speed to generate an explanation should be within clinical users’ tolerable waiting time on the given task. Understand how time sensitive the clinical task is, and record speed and computational resources to generate an explanation. U1.2.1: Decision support for time-sensitive cases, and hard cases
Table 1: The Clinical Explainable AI Guidelines for the design and evaluation of clinical explainable AI. Ways of assessment provides existing evaluating methods as references to assess if a guideline criterion is met. We list key references which supported development of the guidelines.
G - Guidelines, U - Physician user study findings (in Supplementary Material S1)

2.1 Clinical usability considerations

Guideline 1: Understandability.

The format and context of an explanation should be easily understandable by its clinical users. Users do not need to have technical knowledge in machine learning, AI, or programming to interpret the explanation.

Guideline 2: Clinical relevance.

The way physicians use explanations is to inspect the AI-based evidence provided by the explanation, and incorporate such evidence in their clinical reasoning process to assess its validity (U2.2. Resolving disagreement). To make XAI clinically useful, the explanation information should be relevant to physicians’ clinical decision-making pattern, and can support their clinical reasoning process.

For diagnostic/predictive tasks on medical images, physicians’ image interpretation process includes two steps: 1) feature extraction: physicians first perform pattern recognition to localize key features and identify their pathology; 2) reasoning on the extracted features: physicians perform medical reasoning and construct diagnostic hypotheses (differential diagnosis) based on the image feature evidence. A clinically relevant explanation needs to be aligned with such process, so that physicians can incorporate the explanation information into their medical image interpretation process (U3. Clinical requirements of explainable AI).

2.2 Evaluation considerations

Guideline 3: Truthfulness.

Explanation should truthfully reflect the model decision process. This is the fundamental requirement for a clinically oriented explanation, and an explanation method should fulfill the truthfulness requirement first prior to G4: Informative plausibility.

Counterexample:

One of the main clinical utilities of explanation is that clinical users intuitively use explanation plausibility assessment (G4) to verify AI decision on a case to decide whether to take or reject the AI suggestion, and calibrate their trust in AI’s current prediction on the case, or the AI model in general accordingly (U2.3). Users do so with an implicit assumption that explanations are the true representation of the model decision process. If the truthfulness criterion is violated, two consequences may occur during physicians’ use of explanation:

1. Clinical users may wrongly reject AI’s correct suggestion merely for the poor performance of the XAI method, which shows an unreasonable explanation.

2. If an XAI method is wrongly proposed or selected based on explanation plausibility objective only, rather than help clinical users to verify the decision quality, the explanation can be optimized to deceive clinical users with its seemingly plausible explanation, despite the wrong prediction from AI [DBLP:journals/corr/abs-2006-04948].

Assessment method:

The most common way to assess explanation truthfulness for feature attribution XAI methods in the literature is to gradually add or remove features from the most to the least important ones according to an explanation, and measure the model performance change [DBLP:journals/corr/abs-2104-08782, NEURIPS2019_a7471fdc, DBLP:conf/nips/HookerEKK19, 7552539, Lundberg2020, 10.5555/3327757.3327875]. Another way is to construct synthetic evaluation datasets in which the ground truth knowledge on the model decision process from input features to prediction is known and controlled [doshivelez2017rigorous, pmlr-v80-kim18d, DBLP:journals/corr/abs-1806-00069].

Guideline 4: Informative plausibility.

The ultimate use of explanation is to be interpreted and assessed by clinical users. Physicians intuitively use the assessment of explanation plausibility or reasonableness (i.e.: how reasonable the explanation is based on its agreement with human prior knowledge on the task) as a way to evaluate AI decision quality, so that to achieve multifaceted clinical utilities with XAI, including verifying AI’s decisions (U2.3), calibrating trust in AI (U2.3), ensuring the safe use of AI, resolving disagreement with AI (U2.2), identifying potential biases, and making medical discoveries (U2.4). Informative plausibility aims to validate whether an XAI method can achieve its utility in helping users to identify potential AI decision flaws and/or biases. G3 Truthfulness is the gatekeeper of G4 Informative plausibility to warrant the explanation truthfully represents the AI decision process.

Assessment method:

To test whether explanation plausibility is informative to help users identify AI decision errors and biases, AI designers can assess the correlation between AI decision quality measures (such as model performance, calibrated prediction uncertainty, prediction correctness, and quantification of biased patterns) and plausibility measures.

Since human assessment of explanation plausibility is usually subjective and susceptible to biases (U5.2. Bias and limitation of physicians’ quantitative rating), AI designers may consider quantifying the plausibility measure by abstracting the human assessment criteria into computation metrics for a given task. The quantification of human assessment is NOT meant to directly select or optimize XAI methods for clinical use. Rather, XAI methods should be optimized for their truthfulness measures (G3). Quantifying plausibility is a means to validate the explanation informativeness, i.e.: the effectiveness of XAI methods in their subsequent clinical utility to reveal AI decision flaws and/or biases, but not an XAI evaluation end in itself. Quantifying plausibility can make such informativeness validation process automatic, reproducible, standardizable, and computationally efficient. Similarly, the human annotation of important features according to physicians’ prior knowledge which is used to quantify plausibility cannot be regarded as the “ground truth” of explanation, because explanations (given that they fulfill G4 Truthfulness) are still acceptable even if they are not aligned with human prior knowledge, but reveals the model decision quality or help humans to identify new patterns and make medical discoveries.

2.3 Operational consideration

Guideline 5: Computational efficiency

Since many AI-assisted clinical tasks are time-sensitive decisions (U1.2.1: Decision support for time-sensitive cases, and hard cases), the selection or proposal of clinical XAI techniques need to consider the computational time and resources. The wait time for an explanation should not be a bottleneck for the clinical task workflow.

3 Evaluation problem setup

In the previous section, we presented the Clinical XAI Guidelines. Next, we apply the guidelines to a specific problem on multi-modal medical image explanation. Multi-modal medical images, such as multi-parametric MRI, have indispensable diagnostic value in clinical settings. Nevertheless, their related explanation problem has not yet been explored in the technical community. We conduct a systematic evaluation on the commonly-used 16 XAI methods to inspect whether their explanations on multi-modal medical images can fulfill the five objectives outlined in the Clinical XAI Guidelines and can be applied clinically.

3.1 Multi-modal medical imaging: clinical interpretation, learning, and explanation

Our evaluation focuses on the novel problem of multi-modal medical image explanation. Multi-modal medical image explanation can be regarded as a generalized form of single-modal medical image explanation. We present the clinical image interpretation process of multi-modal image, the clinical requirements for multi-modal image explanation, and different model learning paradigms on multi-modal medical image data.

3.1.1 Multi-modal medical images and their clinical interpretation

Multi-modal medical images consist of multiple image modalities or channels, where each modality captures a unique signal of the same underlying cells, tissues, lesions, or organs [MartBonmat2010]. Multi-modal images widely exist in the biomedical domain. For example, different pulse sequences of magnetic resonance imaging (MRI) technique — T1 weighted, T2 weighted, or fluid-attenuated inversion recovery (FLAIR) modalities; dual-modality imaging of positron emission tomography-computed tomography (PET-CT) [pmid12072843]; CT images viewed at different levels and windows to observe different anatomical structures such as bones, lungs, and other soft tissues [HARRIS1993241]; multi-modal endoscopy imaging [Ray2017]; photographic, dermoscopic, and hyper-spectral images of a skin lesion [8333693, Zherebtsov2019]; multiple stained microscopic or histopathological images [Long2020, Song2013].

To interpret multi-modal images, doctors compare and combine modality-specific information to reason diagnosis and differential diagnosis. For instance, in a radiology report on MRI, radiologists usually observe and describe anatomical structures in T1 modality, and pathological changes in T2 modality [cochard_netter_2012, Bitar2006]; doctors can infer the composition of a lesion (such as fat, hemorrhage, protein, fluid) by combining its signals from different MRI modalities [Patel2016]. In addition, some imaging modalities are particularly crucial for the diagnosis and management of certain diseases, such as a contrast-enhanced modality of CT or MRI to a suspect tumor case, and diffusion-weighted imaging (DWI) modality MRI to a suspect stroke case [Lansberg2000].

3.1.2 Clinical requirements for multi-modal medical image explanation

We summarize our findings on the clinical requirements for multi-modal medical image explanation based on our user study with neurosurgeons (Supplementary Material S1) on a glioma grading task with multi-modal brain MRI.

To assess the plausibility of multi-modal explanation, physicians require the explanation to 1) prioritize the important image modality for the model’s decision, and such prioritization may or may not necessarily need to be in concordance with physicians’ prior knowledge on modality prioritization; and 2) capture the modality-specific features. Such features may or may not totally align with doctors’ prior knowledge, but should at least be a subset and do not deviate too much from clinical knowledge.

3.1.3 Multi-modality learning

There are three major paradigms to build convolutional neural network (CNN) models that learn from multi-modal medical images by fusing multi-modal features at the

input-level, feature-level, or decision-level [10.1007/978-3-030-32962-4_18]. Our evaluation covered two fusion settings at the input-level (the brain tumor grading task) and feature-level (the knee lesion identification task). For multi-modal fusion at the input-level, the multi-modal images are stacked as input channels to feed a CNN. The modality-specific information is fused by summing up the weighted modality value in the first convolutional layer. For multi-modal image fusion at the feature-level, each imaging modality is fed to its CNN branch individually to extract features first, and the image features are aggregated at a deeper layer.

3.2 Clinical task, data, and model

We include two clinical tasks in our evaluation on multi-modal medical image explanation: glioma grading on brain MRI, and knee lesion identification on knee MRI. Next, we describe the clinical task, medical imaging dataset, and the training of CNN models prepared for the evaluation.

3.2.1 Glioma grading task

Clinical task

As a type of primary brain tumors, gliomas are one of the most devastating cancers. Grading gliomas based on MRI provides physicians indispensable information on a patient’s treatment plan and prognosis. We focus on the task to classify gliomas into lower-grade (LGG) or high-grade gliomas (HGG).

Data

We used the publicly available BraTS 2020 dataset [Bakas2017] and a BraTS-based synthetic dataset (described in §4.3.3). Both are multi-modal 3D (BraTS) or 2D (synthetic) MR images that consist of four modalities of T1, T1C (contrast enhancement), T2, and FLAIR. The BraTS dataset contains physician annotated glioma localization masks that were used in the plausibility quantification.

Model

For the BraTS dataset, we trained a VGG-like [DBLP:journals/corr/SimonyanZ14a] 3D CNN with six convolutional layers. It receives multi-modal 3D MR images . The evaluation results on the test set in a five-fold cross-validation are  (meanstd) accuracy. We used a weighted sampler to handle the imbalanced data. The models were trained with a learning rate

, batch size = 4, and training epoch of 32, 49, 55, 65, 30 for each fold selected by the validation data.

For the synthetic glioma dataset, we fine-tuned a pre-trained DenseNet121 model [8099726] that receives 2D mutli-modal MRI input slices of . We used the same training strategies as described above. The model achieves accuracy on the test set.

3.2.2 Knee lesion identification task

Clinical task

MRI is the workhorse in diagnosing knee disorders with high accuracies [Rosas2009]. We focus on the task of identifying meniscus tear vs. intact based on knee MRI.

Data

We used the publicly available knee MRI dataset MRNet [Bien2018]. It consists of three modalities showing the knee structure from the coronal, sagittal, and axial view. The coronal view can be T1 weighted, or T2 weighted with fat saturation. The sagittal view is proton density (PD) weighted, or T2 weighted with fat saturation. Finally, the axial view is PD weighted with fat saturation.

We use bounding boxes of the meniscus as the representation of human prior knowledge in the explanation plausibility quantification. They were annotated by the first author who holds an M.D. degree based on knee MRI lesion interpretation principles [Rosas2009]. The bounding boxes are not exact annotations that localize the specific tear lesion, but only outline the anatomical location of the lateral and medial meniscus as a whole. This is meant to be closer to the practical real-world XAI evaluation scenario where only the least amount of annotation effort and domain expertise are required.

Model

We used the same model architecture and training paradigm from the third place of MRNet challenge [Bien2018], which fused multi-modal information at the feature level. We trained five models by only varying their random states of parameter initialization. The model performance area under the curve (AUC) on the validation set is , which is equivalent to the reported ones in [Bien2018]. The test AUC, however, is lower: .

3.3 Post-hoc feature attribution explanation methods

We chose feature attribution explanation methods based on user study assessment on G1 Understandability (detailed in Section §4.1). For feature attribution map methods, we focus on methods that are post-hoc. This group of methods is a type of proxy models that probe the model parameters and/or input-output pairs of an already deployed or trained black-box model. In contrast, the ante-hoc heatmap methods – such as attention mechanism – are predictive models with explanations baked into the training process. We leave out the ante-hoc methods because such explanations are entangled in its specialized model architecture, which would introduce confounders in the evaluation. We include 16 post-hoc XAI algorithms in our evaluation, which belong to two categories:

  • Gradient-based: Gradient [simonyan2014deep], Guided BackProp [springenberg2015striving], GradCAM [8237336], Guided GradCAM [8237336], DeepLift [10.5555/3305890.3306006], InputXGradient [shrikumar2017just], Integrated Gradients [10.5555/3305890.3306024], Gradient Shap [NIPS2017_8a20a862], Deconvolution [10.1007/978-3-319-10590-1_53], Smooth Grad [smilkov2017smoothgrad]

  • Perturbation-based: Occlusion [10.1007/978-3-319-10590-1_53, DBLP:conf/iclr/ZintgrafCAW17], Feature Ablation, Shapley Value Sampling [CASTRO20091726], Kernel Shap [NIPS2017_8a20a862], Feature Permutation [JMLR:v20:18-760], Lime [Ribeiro2016b]

A detailed review of these algorithms and heatmap post-processing method are in Supplementary Material S2222Code is available at: http://github.com/weinajin/multimodal_explanation.

4 Evaluation methods

We present the systematic evaluation to inspect whether the commonly-used heatmap methods can be applied clinically to explain model decisions on multi-modal medical images. The evaluation follows the clinical XAI guidelines (§ 2) to ensure the evaluation results can be an indicator for their suitableness in clinical settings.

4.1 Evaluating G1: Understandability

We applied the end-user XAI prototyping method [jin2021euca] and asked our clinical collaborator to comment and select understandable explanation forms. Based on the neurosurgeon’s feedback and XAI technique availability, we targeted the explanation form of feature attribution map (namely, heatmap).

4.2 Evaluating G2: Clinical relevance

To further identify the clinical relevance of heatmap explanation in the clinical usage scenario, we built an XAI prototype (Fig. 2) and conducted a user study with neurosurgeons. The user study method and findings are detailed in Supplementary Material S1.

Figure 2: XAI prototype for the user study evaluation on G2 Clinical relevance.

4.3 Evaluating G3: Truthfulness

For the truthfulness assessment, we conducted cumulative feature removal and modality importance (MI) evaluation for the two clinical tasks, and proposed two novel metrics diffAUC and MI correlation respectively. We further conducted a synthetic data experiment on the glioma grading task.

4.3.1 Cumulative feature removal

To test if the heatmap highlighted regions are true important features to the model’s decision, we cumulatively removed the input image features from the most to the least important ones according to the heatmap quantile, and plotted the relationship of the cumulative feature removal to the model performance metric (accuracy for the glioma task, and AUC for the knee task). The metric

diffAUC slightly modifies the feature removal experiment method in literature [DBLP:journals/corr/abs-2104-08782, NEURIPS2019_a7471fdc, DBLP:conf/nips/HookerEKK19, 7552539, Lundberg2020, 10.5555/3327757.3327875] by introducing a random baseline for fair comparison among different XAI methods. For different XAI methods, the absolute numbers of highlighted image pixels/voxels are different, thus the performance deterioration measure may be confounded by the number of highlighted image regions. diffAUC quantifies the degree of performance deterioration by calculating the AUC difference between an XAI algorithm and its baseline The random baseline curve is generated by removing the same number of random features at each feature removal quantile. An XAI algorithm with a larger diffAUC indicates it can better identify important features for model prediction compared with random baseline.

4.3.2 Modality importance

For multi-modal medical image explanation, we want to assess how truthfully a heatmap reflects the modality importance information used in the model decision process. This corresponds to the clinical requirements on modality prioritization (U4.2. The role and prioritization of multiple modalities). We first calculate the ground truth modality importance score using Shapley value method, then calculate the correlation between modality-wise sum of heatmap value and the ground truth as the modality importance correlation (MI correlation).

To determine the ground-truth modality importance, we use Shapley value from cooperative game theory 

[RM-670-PR], due to its desirable properties such as efficiency, symmetry, linearity, and marginalism. In a set of modalities, Shapley value treats each modality as a player in a cooperative game play. It is the unique solution to fairly distribute the total contributions (in our case, the model performance) to each individual modality .

We define the modality Shapley value to be the ground truth modality importance score for a modality . It is calculated as:

(1)

where is the modality-specific performance metric (accuracy for the glioma task, and AUC for the knee task), and denotes all modality subsets not including modality .

To measure the agreement of heatmaps’ modality importance value with the ground truth modality Shapley value, for each heatmap, we define the estimated MI as the modality-wise sum of all positive values in the heatmap.

MI correlation measures the MI ranking agreement between the ground-truth and the estimated MI, calculated using Kendall’s Tau-b correlation.

4.3.3 Synthetic data experiment

The idea of constructing synthetic data to validate the truthfulness of an XAI method is that, we have the full control of the ground truth features that the model learned for its prediction, therefore, the ground truth features are also the ground truth for model decision rationale we want the explanation to capture. We can then assess the agreement between the explanation and the ground truth features using the same plausibility measure as detailed in § 4.4.1).

For multi-modal medical image tasks, according to the multi-modal medical image interpretation pattern identified in our user study (U4), we categorize the ground truth explanation information into: 1. the relative importance of each modality to the prediction (i.e.: modality importance in §4.3.2); and 2. localization of the modality-specific features. We constructed a synthetic multi-modal brain MRI dataset on the glioma grading task with the two ground truth information corresponding to the prediction label.

Specifically, to control the ground truth of feature localization, we use a GAN-based (generative adversarial network) tumor synthesis model developed by

[Kim2021] to generate two types of tumors and their segmentation masks, mimicking lower- and high-grade gliomas by varying their shapes (round vs. irregular [Cho2018]).

To control the ground truth of modality importance, inspired by [pmlr-v80-kim18d]

, we set tumor features on T1C modality to have 100% alignment with the ground-truth label, and on FLAIR to have a probability of 70% alignment, i.e., the tumor features on FLAIR corresponds to the correct label with 70% probability. The rest modalities have 0 modality importance value, as they are designed to not contain class discriminative features. The model may learn to pay attention to either the less noisy T1C modality, or the more noisy FLAIR modality, or both. To determine their relative importance as the ground truth modality importance, we test the well-trained model on two test sets:

TIC dataset: The dataset shows tumors only (without brain background) on all modalities. And the tumor shape has 100% alignment with ground-truth on T1C modality, and 0% alignment on FLAIR. Its test accuracy is denoted as .

FLAIR dataset: It has the same settings, but only differs in that the tumor shape has 100% alignment with ground-truth on FLAIR modality, and 0% alignment on T1C. Its test accuracy is denoted as .

The test performance and indicate the degree of model reliance on that modality to make predictions. We use them as the ground truth modality importance. On the test set, , . In this way, we constructed a model with known ground truth of modality important as 1 for T1C, and 0 for the rest modalities. We then calculate the plausibility metric as the measure of truthfulness for the synthetic data.

4.4 Evaluating G4: Informative plausibility

Given an XAI method that meet G3: Truthfulness, to further validate whether clinical users can use their own assessment on explanation plausibility to judge decision quality and identify potential errors and biases, next we assess whether the human plausibility assessment is informative. We do so in two steps: 1) proposing a novel plausibility metric modality specific feature importance (MSFI) on multi-modal explanation task that bypasses physicians’ manual assessment; and 2) testing the correlation between plausibility metric and decision quality metric.

4.4.1 Quantifying plausibility

Figure 3: Illustration of the novel modality importance correlation and MSFI metrics on multi-modal medical image explanation.

To quantify how reasonable the explanation is to human judgment and facilitate subsequent validation of using such plausibility information for AI decision verification, we used an existing metric feature portion (FP), and proposed a novel metric modality-specific feature importance (MSFI) designed for multi-modal medical image explanation based on its clinical requirements (§4.3.3). Both metrics quantify the agreement of heatmap highlighted regions with human prior knowledge.

FP assesses, among the highlighted regions in the heatmap, how many of them agree with human prior knowledge. It is calculated as:

FP (2)

where is a heatmap, with denoting the spatial location. is the human-annotated feature masks, with outlining the spatial location of the feature. is the indicator function that selects the heatmap values inside the feature mask.

To abstract the clinical requirements for multi-modal medical image explanation (U4. Multi-modal medical image interpretation and clinical requirements for its explanation), we propose a novel plausibility metric MSFI for multi-modal explanation (Fig. 3). It combines the assessments of feature localization with modality prioritization, by multiplying FP with modality importance value modality-wise. Specifically, MSFI is the portion of heatmap values inside the feature localization mask for each modality , weighted by MI which is normalized to to have a comparable range with FP.

(3)
MSFI (4)

where where is unnormalized, and MSFI is the normalized metric in . A higher MSFI score indicates a heatmap is more agreeable with clinical prior knowledge regarding capturing the important modalities and their localized features. MSFI can be regarded as a general form of FP that generalized the feature portion calculation from single-modality to multi-modality images.

Instead of asking physicians to manually assess plausibility for a few explanations (the questionnaire in Fig. 2 demonstrates such process), whose rating may be susceptible to bias (U5.2. Bias and limitation of physicians’ quantitative rating), quantifying plausibility bypasses humans’ manual assessment, standardizes and automates the assessment process, and can assess multiple XAI methods using one set of annotated data.

In addition, although plausibility quantification requires annotations to represent human prior knowledge, the human prior knowledge annotation may not necessarily need to be as exact as feature segmentation masks, because MSFI and FP only penalize for regions outside the annotation mask333

In comparison, we did not use the intersection over union (IoU) metric commonly used in computer vision, because compared to

MSFI or FP that penalizes only for false positives, IoU also penalizes for false negatives, which require the annotations to be exact. . Therefore, the annotation can be in the form of segmentation masks, bounding boxes, or landmarks. In our evaluation, we used tumor segmentation masks for the glioma task, and bounding boxes for the knee task. The annotations may not even need to be annotated by humans. It can be generated by training an AI model on a few annotated data points, or be generated using trained models on feature segmentation/localization tasks.

4.4.2 Testing for plausibility informativeness

The indispensable step after plausibility quantification is to validate the clinical utility of using explanations to verify AI decision quality. We measure AI decision quality by using the model prediction correctness on the two classification tasks, and output probability. We then test the correlation between prediction probability with plausibility, and equal distribution of plausibility for different prediction correctness groups.

4.5 Evaluating G5: Computational efficiency

We recorded the computational time to generate each heatmap on a computer with 1 GTX Quadro 24 GB GPU and 8 CPU cores, and on a computing cluster with similar hardware configurations.

5 Evaluation results

We report evaluation results on whether the commonly-used 16 heatmap methods are clinically feasible by fulfilling the guidelines on the two clinical tasks with multi-modal medical images. All the results were reported on the test dataset.

Both G1 and G2 are qualitative assessment with respect to clinical applicability of the general form of heatmap explanation, and are non-specific to a heatmap method. In contrast, the rest of guidelines G3-5 are quantitative assessment and are specific to each heatmap method. Moreover, although G4 Informative plausibility and G2 Clinical relevance both focus on the aspect of human interpretation of the explanation, plausibility focuses on the content of explanation, whereas G2 Clinical relevance assesses a group of XAI methods that are represented in the same explanation form.

5.1 Evaluating G1 Understandability and G2 Clinical relevance

In our user study, although physicians did not express difficulty in understanding the meaning of heatmap as important regions for AI prediction (G1: Understandability is met), the heatmap explanation is not completely clinically relevant, as physicians were perplexed by the highlighted areas regardless of whether these areas align with their prior knowledge or not. This may be due to heatmap explanation only performs half of the clinical image interpretation step of feature localization, it lacks the description of important features, let alone to perform reasoning on these features (U3.1. Limitations of existing heatmap explanation). Therefore, heatmap explanation only partially fulfills G2 Clinical relevance.

Cumulative feature removal Modality importance correlation Synthetic data experiment
diffAUC [-1, 1] MI correlation [-1, 1] MSFI [0, 1]
Glioma Knee Glioma Knee Synthetic glioma
Deconvolution 0.11 0.21 -0.080.06 0.730.39 -0.600.53 0.040.02
DeepLift 0.19 0.14 NaN 0.530.45 NaN 0.220.23
Feature Ablation 0.30 0.15 -0.030.01 0.270.44 0.330.60 0.190.23
Feature Permutation 0.05 0.05 NaN NaN NaN 0.080.07
GradCAM 0.16 0.19 NaN NaN NaN 0.020.02
Gradient 0.05 0.09 -0.100.02 0.470.16 -0.470.50 0.190.13
Gradient Shap 0.17 0.12 -0.070.03 0.530.40 -0.600.33 0.220.19
Guided BackProp 0.21 0.24 -0.080.04 0.800.27 -0.470.50 0.490.21
Guided GradCAM 0.26 0.25 NaN 0.810.26 NaN 0.420.29
InputXGradient 0.17 0.12 -0.120.03 0.870.16 -0.470.50 0.230.14
Integrated Gradients 0.17 0.12 -0.120.03 0.730.39 0.200.65 0.220.19
Kernel Shap 0.26 0.16 0.020.01 NaN 0.070.80 0.080.08
Lime 0.37 0.08 0.040.02 0.530.58 0.730.33 0.050.07
Occlusion 0.13 0.15 -0.060.02 0.600.33 0.070.80 0.220.25
Shapley Value Sampling 0.35 0.04 0.010.03 0.470.65 0.200.78 0.100.10
Smooth Grad 0.29 0.25 -0.100.03 0.670.00 -0.470.50 0.030.02
Table 2: Evaluation results on Guideline 3 Truthfulness. The table shows mean std for each XAI algorithm regarding different evaluation metrics. For all metrics, a higher value is better. Top results for a metric are bolded. “NaN” in the glioma task is due to the heatmap is not modality-specific and the correlation is not computable. “NaN” in the knee task is due to the XAI method was not included in the evaluation. Metrics have their range indicated. XAI methods are in alphabetic order.
Plausibility infomativeness testing Quantifying plausibility
p value MSFI [0, 1] FP [0, 1]
Glioma Knee Glioma Knee Glioma Knee
Deconvolution NS NS 0.260.23 0.230.05 0.170.17 0.230.05
DeepLift NaN 0.540.34 NaN 0.430.32 NaN
Feature Ablation 0.480.30 0.170.05 0.350.28 0.170.06
Feature Permutation NS NaN 0.230.26 NaN 0.130.18 NaN
GradCAM NaN 0.040.03 NaN 0.020.01 NaN
Gradient NS NS 0.340.23 0.240.05 0.200.16 0.250.05
Gradient Shap NS 0.480.31 0.230.05 0.360.28 0.230.05
Guided BackProp NS NS 0.480.33 0.260.05 0.340.29 0.250.05
Guided GradCAM NaN 0.500.36 NaN 0.370.31 NaN
InputXGradient NS 0.510.32 0.240.05 0.400.30 0.240.05
Integrated Gradients NS 0.480.31 0.230.06 0.360.28 0.230.06
Kernel Shap 0.280.25 0.160.05 0.180.20 0.160.05
Lime 0.240.21 0.170.05 0.140.16 0.170.05
Occlusion NS 0.280.26 0.200.06 0.180.19 0.210.06
Shapley Value Sampling 0.380.24 0.170.06 0.250.21 0.170.06
Smooth Grad NS 0.270.17 0.240.05 0.160.12 0.240.05
Table 3: Evaluation results on Guideline 4 Informative plausibility. P values are from the Mann–Whitney U test on MSFI metric in two groups of correct/incorrect predictions, with indicates ; for ; for ; NS for not significant. The table also shows mean std for each XAI algorithm regarding plausibility metrics. Metrics have their range indicated. “NaN” in the knee task is due to the XAI method was not included in the evaluation. XAI methods are in alphabetic order.
Computational time
second
Glioma Synthetic Glioma Knee
Deconvolution 2.11.2 1.30.0 2.62.1
DeepLift 4.62.0 2.20.0 NaN
FeatureAblation 8225 581.5 98102
FeaturePermutation 10.12.1 15.20.4 NaN
GradCAM 0.70.3 0.30.0 NaN
Gradient 2.21.3 1.10.0 2.62.2
GradientShap 7.83.3 5.00.1 2.82.2
GuidedBackProp 2.11.2 0.90.0 2.31.7
GuidedGradCAM 2.81.5 1.20.0 NaN
InputXGradient 2.11.2 1.10.0 2.62.2
IntegratedGradients 6734 490.9 11379
KernelShap 24387 931.6 382388
Lime 449141 1542.6 507523
Occlusion 171321 273.5 672255
ShapleyValueSampling 2205693 1595228 19902021
SmoothGrad 14.46.8 9.50.1 24.116.7
Table 4: Evaluation results on Guideline 5 Computational efficiency. We report the mean std speed in second to generate a heatmap on a data point. “NaN” in the knee task is due to the XAI method was not included in the evaluation. The XAI methods are in alphabetic order.

5.2 Evaluating G3: Truthfulness

The evaluation results on G3 Truthfulness are shown in Table 2. For the cumulative feature removal test, the diffAUC scores were around 0, which indicates the examined XAI methods did not differ from its random feature removal test baseline. Even for methods that had the highest diffAUC score (Fig. 4

), their diffAUC scores were not stable and had a large variance among the five models that were similarly trained.

The diffAUC metric is a global metric that tests on the AI model as a whole, whereas the other evaluating metrics are tested on individual data point. For the data-wise metrics, except MI correlation on the glioma task, the rest metrics were showing similar trend as diffAUC, which fell in the lower range. In addition, both data-wise or model-wise metrics tended to had large variances, suggesting the instability of the examined XAI methods in identifying the true important features for models’ prediction. The MI correlation was relatively high on the glioma task, indicating most XAI methods can indicate the modality importance information of the model at the modality level, but not at the feature level due to the low diffAUC score. For the knee task, the methods did not perform well in identifying either important features (low diffAUC) or important modalities (low MI correlation).

For the synthetic data experiment on the glioma task, the MSFI scores were generally low, with Guided BackProp and Guided GradCAM outstanding other XAI methods. But their outperformance was only specific for the synthetic dataset, and did not generalize to the original glioma or knee task.

Based on the results, the examined XAI methods failed to meet G3 Truthfulness.

Figure 4:

Cumulative feature removal test for evaluation on G3 Truthfulness. The solid line is the model performance deterioration for an XAI method, and the dashed line is the XAI method counterpart baselines of random feature removal. The random baseline tests were repeated for 15 times in our experiments, thus the dashed line also has its 95% confidence interval indicated as translucent error bands. We show plots of the XAI method that has the highest and lowest diffAUC score from model 1 among the five trained models for both clinical tasks.

5.3 Evaluating G4: Informative plausibility

Figure 5: Results on Guideline 4: Testing for plausibility informativeness The violin plots show the plausibility quantification score distribution of MSFI for the correct (blue, left) and incorrect (orange, right) predictions on the glioma (top) and knee task (bottom). X-axis is each heatmap method. Y-axis is the MSFI measure, with a higher score indicating more agreeable of a heatmap with clinical prior knowledge on modality prioritization and feature localization. Based on visual inspection, the correct and incorrect predictions did not differ from their MSFI measure, indicating the plausibility assessment failed to reveal model decision quality and were not informative to identify potential errors of the model.

5.3.1 Quantifying plausibility

Physicians’ average quantitative rating on heatmap quality had a higher Pearson’s r correlation with MSFI (0.45) compared with FP (0.36). Physicians’ inter-rater agreement on the heatmap quality assessment was low: Krippendorff’s Alpha is 0.464 (cutoff value 0.667 [krippendorff2004content]), and Fleiss’ kappa is 0.008 (with 1 for perfect agreement and 0 for poor agreement). This indicates that doctors’ judgment of heatmap quality could be very subjective, which aligns with qualitative findings on U5.2. Bias and limitation of physicians’ quantitative rating. Given such case, we resort to quantifying the human assessment of explanation plausibility using MSFI score, which had a higher correlation compared with FP.

5.3.2 Testing for plausibility informativeness

Since G3 Truthfulness is the prerequisite for G4 on plausibility informativeness, and none of the examined XAI methods passed the evaluation on G3, it is less meaningful to conduct assessment on plausibility informativeness. Nevertheless, we report the evaluation results as references.

We computed the Pearson’s r correlation between prediction output probabilities and the plausibility measure MSFI. For both glioma and knee tasks, there were negligible correlations: 0.10 for the glioma task, and -0.07 for the knee task. Given that the softmax prediction confidence is poorly calibrated [DBLP:journals/corr/GuoPSW17], it may not be a good indicator for model decision quality.

We then resorted to model prediction correctness as the definitive indicator for decision quality. Using Mann-Whitney U test, we tested the equal distribution hypothesis of the MSFI metric on the correctly and incorrectly predicted data groups, and the significance level for each XAI algorithm is shown in Table 3. For some XAI methods such as Feature Ablation, despite statistical test was showing a significant difference of MSFI metric between the correct and incorrect predictions on the knee task, by further inspecting their distributions (Fig. 5), the ranges of the correct and incorrect predictions largely overlapped with each other. Moreover, the wrong predictions even had a slightly higher mean of MSFI, which is counterintuitive for users to use the plausibility assessment to examine AI decision correctness.

Based on the results on testing for plausibility informativeness, the examined XAI methods did not meet G4 Informative plausibility.

5.4 Evaluating G5: Computational efficiency

The computational time spent in generating a heatmap is shown in Table 4. The speed in generating a heatmap was stable across the three datasets with different image dimensions (2D and 3D) and model architectures. Some gradient-based methods that rely solely on back propagation can generate nearly instant explanation, which enables their clinical use in real-time interactive XAI systems. For some gradient-based and all perturbation-based methods that require multiple sampling, their speed is seconds or even longer. Methods such as Lime or Shapley Value Sampling need to take 730 minutes to generate a heatmap, depending on the specific use case and XAI method parameter settings, the long wait time may prevent their clinical use.

6 Discussions

6.1 Evaluated heatmap methods failed to meet the Clinical XAI Guidelines

We conducted a systematic evaluation on 16 commonly-used heatmap methods following the Clinical XAI Guidelines. Although the heatmap explanations were easily understandable to clinical users (G1), they only partially fulfilled G2 clinical relevance, due to the missing of feature pathology descriptions in the heatmap which corresponds to clinical image interpretation process (§5.1). The examined heatmap methods did not reliably exhibit the property of G3 Truthfulness on multiple models in the two clinical tasks. Due to the failure of G3, G4 testing for informative plausibility also had a poor score. Most heatmaps were computationally efficient regarding G5 that can generate a heatmap within seconds, except for some sampling-based methods such as Shapley Value Sampling which may take more than 20 minutes.

The evaluation results show that the evaluated 16 post-hoc heatmap methods were not technically sound enough to meet the Clinical XAI Guidelines.

6.2 Use of the Clinical XAI Guidelines

Our systematic evaluation demonstrated the use of the guidelines in the evaluation of XAI in two clinical tasks. Specifically, if we go back to Alex’s questions in the beginning, to apply the guidelines on a clinical XAI problem for XAI method selection or proposal, AI designers like Alex may first talk to their target clinical users or other stakeholders to understand their AI literacy (G1 Understandability), their clinical reasoning process which relates to the interpretation of explanation (G2 Clinical relevance). Based on the conversation, AI designers may have a clearer idea about which form(s) of explanation to target.

For the targeted form of explanation such as feature attribution map, there may be multiple XAI algorithms that can generate it. To design or select the optimal XAI algorithm of the target explanation form, AI designers may choose suitable metrics to assess and optimize XAI methods on the G3 Truthfulness measure. AI designers may also need to test the truthfulness metrics for an XAI algorithm on multiple trained AI models to examine the robustness of XAI method in truly reflecting the model decision process.

For the XAI method candidates that passed the truthfulness assessment, to validate whether the explanation is clinically useful in alerting physicians of AI potential decision flaws, AI developers may further test such property for the XAI method candidates (G4 Informative plausibility). To do so, AI designers can ask clinical users about which features or criteria they are based on to judge the plausibility of explanation, and select computational metrics and prepare data annotations based on the plausibility quantification criteria. Then AI developers can test the correlation between plausibility and decision quality.

AI designers may also need to record the G5 Computational efficiency of the XAI method candidates to rule out the ones that do not meet the speed and computational resource requirement in clinical deployment.

7 Limitations and future work

The Clinical XAI Guidelines focus on the general clinical requirements for AI explanation. Some task-dependent requirements for XAI methods, such as data privacy protection, were not included in the guidelines. They can be add-on requirements in addition to the guideline criteria for specific clinical tasks.

In G4 testing for informative explanation, we used a statistical test on the distribution difference, which is not directional and too sensitive. Future work may investigate novel evaluation methods and metrics for the assessment of plausibility informativeness regarding its ability in revealing AI decision quality, such as developing new metrics to test with decision correctness, or assess plausibility informativeness via model decision uncertainty estimation. In addition to testing the informativeness regarding model decision error in our evaluation, future work is needed to develop assessment method and metrics on the plausibility informativeness in revealing decision biases.

8 Conclusion

In the work, we propose the Clinical XAI Guidelines to support the design and evaluation of clinically-oriented XAI systems. The proposal of the guidelines was based on dual understandings of the clinical requirements for explanations from our physician user study, and technical understanding from our previous XAI evaluation studies and XAI literature. The guidelines G1 Understandability and G2 Clinical relevance provide clinical insights for the selection of explanation forms. Guideline G3 Truthfulness, G4 Informative plausibility, and G5 Computational efficiency incorporates the clinical requirements on explanation as clear technical objectives to be optimized for.

Based on the guidelines, we conducted a systematic evaluation on 16 commonly-used heatmap methods. The evaluation focused on a technically-novel and clinically-pervasive problem of multi-modal medical image explanation with two clinical tasks of brain tumor grading and knee lesion identification. We proposed a novel metric, MSFI for multi-modal medical image explanation tasks, to bypass physicians’ manual assessment of explanation plausibility. The evaluation results showed that the evaluated heatmap methods failed to fulfill G3 and G4, thus were not suitable for clinical use. The evaluation demonstrates the use of Clinical XAI Guidelines in real-world clinical tasks to facilitate the design and evaluation of clinically-oriented XAI.

Acknowledgements

We thank all physician participants in our user study. We thank Sunho Kim, Yiqi Yan, Mayur Mallya, and Shahab Aslani for their generous support and helpful discussions. This study was funded by BC Cancer Foundation-BrainCare BC Fund, and was enabled in part by support provided by NVIDIA and Compute Canada (www.computecanada.ca).

Appendix: Clinical Explainable AI Guidelines

In an effort to guide the design and evaluation of clinical XAI to meet both clinical and technical requirements, we present a checklist including five canonical criteria which we believe may serve as guidelines for developing clinically-oriented XAI. The guidelines were developed with a collective effort from both the clinical and technical sides (with complementary expertise in AI, human factor analysis, and clinical practice). And it was motivated and supported by findings from our physician user study, pilot XAI evaluation experiments [aaai2022, DBLP:journals/corr/abs-2107-05047]

, and literature. We seek feedbacks from two physicians and several researchers on medical image analysis as heuristics evaluation of the guidelines.

To acquire physicians’ requirements for clinical XAI, we conducted a physician user study with 30 neurosurgeons to elicit their clinical requirements by using a clinical XAI prototype. The low-fidelity prototype is a clinical decision-support AI system that provides suggestions from a CNN model to differentiate lower-grade gliomas from high-grade ones based on multi-modal MRI. For each AI suggestion, it also shows a heatmap explanation that highlights the important features for model prediction. The user study consisted of an online survey that embedded the XAI prototype and collected physicians’ quantitative rating of the heatmaps, and an optional post-survey interview where physicians comment on the clinical XAI system. Five physicians participated in the interview, and seven physicians provided comments in the survey by answering open-ended questions. We analyzed the qualitative data collected from interview sessions and open-ended questions in the survey as the main support to develop the guidelines from the clinical aspect. The detailed user study findings and method are in Supplementary Material S1, and its related supporting sections were referred in the guidelines starting with ‘U’.

Next, we present the Clinical XAI Guidelines, which consist of five evaluating objectives to optimize a clinical XAI technique. They are categorized into three considerations on clinical usability, evaluation, and operation. For each objective in the guidelines, we list its key references from our user study or literature. We also analyze examples that follow the objective and/or counterexamples that violate it. Ways of assessment are also described to help identify if the objective is met. The guidelines and their key points are summarized in Table 1.

8.1 Clinical usability considerations

Guideline 1: Understandability.

The form and context of an explanation should be easily understandable by its clinical users. Users do not need to have technical knowledge in machine learning, AI, or programming to interpret the explanation.

  • Example:

    Physicians find the feature attribution maps (heatmaps) used in our user study easily understandable. Other explanation formats on medical image analysis tasks such as similar examples [10.1145/3290605.3300234], counterfactual examples , and scoring (linear feature attribution) or rule-based explanation, as shown in prior physician user studies in the literature. [jin2021euca]

    summarized 12 end-user-friendly explanation forms that do not require technical knowledge, including feature-based (feature attribution, feature shape, feature interaction), example-based (similar, prototypical, and counterfactual example), rule-based explanation (rules, decision tree), and supplementary information (input, output, performance, dataset). In addition to the explanation that reveals the model decision process, in our user study, physicians also required other information that makes the AI model transparent, such as model performance, training dataset, and prediction confidence (U3.3: Making AI transparent by providing information on performance, training dataset, and decision confidence). An XAI system may use one or a combination of multiple explanation forms that are friendly to clinical users.

  • Counterexample:

    A counterexample of understandability is to explain by visualizing the learned representation of neurons in DNN 

    [Olah2017]. Although the form of neuron visualization as images is intuitive to look at, interpreting the images requires users to have prior knowledge on DNN model and neuron to understand the context of neuron visualization.

  • Assessment method:

    To assess if the understandability objective is met, AI designers can conduct self-assessment on an XAI technique to inspect its AI knowledge prerequisites, conduct a pilot physician usability study, or have informal conversations with clinical users to understand their minimal AI literary, and choose proper explanation techniques accordingly. Low-fidelity prototypes such as sketches can be used as a quick trial-and-error tool and help clinical users better vision an explanation in a clinical context. [jin2021euca] provided prototyping support to identify clinical user-friendly explanations.

Guideline 2: Clinical relevance.

The way physicians use explanations is to inspect the AI-based evidence provided by the explanation, and incorporate such evidence in their clinical reasoning process to assess its validity (U2.2. Resolving disagreement). To make XAI clinically useful, the explanation information should be relevant to physicians’ clinical decision-making pattern, and can support their clinical reasoning process.

For diagnostic/predictive tasks on clinical images, physicians’ image interpretation process includes two steps: 1) feature extraction: physicians first perform pattern recognition to localize key features and identify their pathology; 2) reasoning on the extracted features: physicians perform medical reasoning and construct diagnostic hypotheses (differential diagnosis) based on the image feature evidence. A clinically relevant explanation needs to be aligned with such process, so that physicians can incorporate the explanation information into their medical image interpretation process (U3. Clinical requirements of explainable AI).

“What (explanation) we get currently, when a radiologist read it, they point out the significant features, and then they integrate those knowledge, and say, to my best guess, this is a GBM. And I have the same expectations of AI (explanation).” (N3)

  • Example:

    In the user study, physicians visioned the ideal explanations that are clinically relevant (U3.2. Desirable explanation), such as using radiologists’ language, a linear scoring model, or a rule-based explanation. Those explanations are composed of clinically meaningful features. And their form of text, rule, or linear model corresponds to the second step of the reasoning process on the extracted features in the above clinical image interpretation process.

  • Counterexample:

    The heatmap explanation is not completely clinically relevant, as physicians were perplexed by the highlighted areas, regardless of whether the areas align with their prior knowledge or not. Because the heatmap explanation only performs half of the clinical image interpretation step 1) of feature localization, it lacks the description of important features, let alone to perform reasoning on these features (U3.1. Limitations of existing heatmap explanation).

    “Though the heatmap is drawing your eyes to many different spots, but I feel like I didn’t understand why my eyes were being driven to those spots, like why were these very specific components important? And I think that’s where all my confusion was.” (N2)

  • Assessment method:

    Physician user studies should be conducted to understand the clinical decision-making pattern on the target task, and inspect whether the explanation corresponds to such pattern, and can help physicians answer their questions on the rationale of the model decision.

8.2 Evaluation considerations

Guideline 3: Truthfulness.

Explanation should truthfully reflect the model decision process. This is the fundamental requirement for a clinically-oriented explanation, and an explanation method should fulfill the truthfulness requirement first prior to other evaluation requirements such as G4: Informative plausibility in the guidelines.

  • Counterexample:

    One of the main clinical utilities of explanation is that clinical users intuitively use explanation plausibility assessment (G4) to verify AI decisions for a case to decide whether to take or reject the AI suggestion, and calibrate their trust in AI’s current prediction on the case, or the AI model in general accordingly (U2.3). Users do so with an implicit assumption that explanations are the true representation of the model decision process. If the truthfulness criterion is violated, two consequences may occur during the human assessment on explanation plausibility (G4):

    1. Clinical users may wrongly reject AI’s correct suggestion merely for the poor performance of the XAI method, which shows an unreasonable explanation.

    2. If an XAI method is wrongly proposed or selected based on explanation plausibility objective only, rather than help clinical users to verify the decision quality, the explanation can be optimized to deceive clinical users with its seemingly plausible explanation, despite the wrong prediction from AI [DBLP:journals/corr/abs-2006-04948], as illustrated by the physician participant N1’s quote:

    “If a system made its prediction based upon these areas (outside the tumor), I would definitely not trust that system, but I would be very reassured that the system is telling me that. …So I’m less likely to use this model, but I’m more likely to use a model that does a better job than this, because I am reassured that when I see that better model, that I will be able to have access to that back-end explanation. ” (N1)

  • Assessment method:

    As stated in [jacovi-goldberg-2020-towards], the truthfulness or faithfulness objective cannot and should not be assessed by human judgment on the explanation quality or annotations of the human prior knowledge, because humans do not know the model’s underlying decision process.

    The most common way to assess explanation truthfulness for feature attribution XAI methods in the literature is to gradually add or remove features from the most to the least important ones according to an explanation, and measure the model performance change [DBLP:journals/corr/abs-2104-08782, NEURIPS2019_a7471fdc, DBLP:conf/nips/HookerEKK19, 7552539, Lundberg2020, 10.5555/3327757.3327875]. Another way is to construct synthetic evaluation datasets in which the ground truth knowledge on the model decision process from input features to prediction is known and controlled [doshivelez2017rigorous, pmlr-v80-kim18d, DBLP:journals/corr/abs-1806-00069].

Guideline 4: Informative plausibility.

The ultimate use of explanation is to be interpreted and assessed by clinical users. Physicians intuitively use the assessment of explanation plausibility or reasonableness (i.e.: how reasonable the explanation is based on its agreement with human prior knowledge on the task) as a way to evaluate AI decision quality, so that to achieve multifaceted clinical utilities with XAI, including verifying AI’s decisions (U2.3), calibrating trust in AI (U2.3), ensuring the safe use of AI, resolving disagreement with AI (U2.2), identifying potential biases, and making medical discoveries (U2.4). Informative plausibility aims to validate whether an XAI method can achieve its utility in helping users to identify potential AI decision flaws and/or biases. G3 Truthfulness is the gatekeeper of G4 Informative plausibility to warrant the explanation truthfully represents the AI decision process.

  • Example:

    In our evaluation, we abstract physicians’ clinical requirements on multi-modal medical image explanation (U4.) into the MSFI metric. It regards the most plausible heatmap explanation as some maps that can both localize the important image feature on each imaging modality, and highlight the important modalities for decision. We evaluated how well MSFI metric corresponds to physicians’ assessment by quantitative measure to calculate the correlation between the two, and showcase the visual examples as a qualitative measure. We then inspect the subsequent utility of MSFI metric on verifying model decisions, by measuring its correlation with decision correctness.

  • Assessment method:

    To test whether explanation plausibility is informative to help users identify AI decision errors and biases, AI designers can assess the correlation between AI decision quality measures (such as model performance, calibrated prediction uncertainty, prediction correctness, and quantification of biased patterns) with plausibility measures.

    Since human assessment of explanation plausibility is usually subjective and susceptible to biases (U5.2. Bias and limitation of physicians’ quantitative rating), AI designers may consider quantifying the plausibility measure by abstracting the human assessment criteria into computation metrics for a given task. The quantification of human assessment is NOT meant to directly select or optimize XAI methods for clinical use. Rather, XAI methods should be optimized for their truthfulness measures (G3). Plausibility quantification is meant to validate the capability of XAI methods on their subsequent clinical utility to reveal AI decision flaws and/or biases, providing their high truthfulness score. Quantifying plausibility can make such validation process automatic, reproducible, standardizable, and computationally efficient. Similarly, the human annotation of important features according to physicians’ prior knowledge, which is used to quantify plausibility, cannot be regarded as the “ground truth” of explanation, because explanations (given that they fulfill G4 Truthfulness) are still acceptable even if they are not aligned with human prior knowledge, but reveals the model decision quality or help humans to identify new patterns and make medical discoveries.

    Many approaches were proposed to quantify explanation plausibility measure. These measures calculate the agreement of explanation with human prior knowledge annotations for a given task [10.1007/978-3-030-32226-7_82, netdissect2017]. To evaluate whether the quantified plausibility measure is a good substitute for human assessment, AI designers can use a quantitative measure by calculating the correlation between the plausibility metric and clinical users’ assessment score, or use a qualitative measure by showing physicians different explanations and their plausibility score, and ask them to judge.

8.3 Operational consideration

Guideline 5: Computational efficiency.

Since many AI-assisted clinical tasks are time-sensitive decisions (U1.2.1: Decision support for time-sensitive cases, and hard cases), the selection or proposal of clinical XAI techniques need to consider the computational time and resources. The wait time for an explanation should not be a bottleneck for the clinical task workflow.

  • Example:

    In our evaluation, some gradient-based XAI methods that use back propagation can generate nearly instant explanation within 10 seconds. This also enables their clinical use in generating real-time interactive explanations.

  • Counterexample:

    For XAI techniques that require sampling input-output pairs, their computational time may be too long for physicians to wait for an explanation. In our evaluation, it took about 30 minutes for Shapley Value Sampling method to generate one heatmap on a typical desktop computer with GPU.

  • Assessment method:

    AI designers can record the computational time and resources for XAI method to assess whether the requirement on computational efficiency is met. AI designers may also need to talk to clinical users to understand whether their clinical task includes time-sensitive decisions, and their maximum tolerable waiting time for an explanation on the task. For some XAI methods, the computational time depends on the settings of some specific parameters, such as the number and size of feature masks to generate the perturbed samples, the number of samples. AI designers need to identify the optimal set of parameters to balance explanation accuracy and computational efficiency.

References