The Role of Explainability in Assuring Safety of Machine Learning in Healthcare

09/01/2021 ∙ by Yan Jia, et al. ∙ 0

Established approaches to assuring safety-critical systems and software are difficult to apply to systems employing machine learning (ML). In many cases, ML is used on ill-defined problems, e.g. optimising sepsis treatment, where there is no clear, pre-defined specification against which to assess validity. This problem is exacerbated by the "opaque" nature of ML where the learnt model is not amenable to human scrutiny. Explainable AI methods have been proposed to tackle this issue by producing human-interpretable representations of ML models which can help users to gain confidence and build trust in the ML system. However, there is not much work explicitly investigating the role of explainability for safety assurance in the context of ML development. This paper identifies ways in which explainable AI methods can contribute to safety assurance of ML-based systems. It then uses a concrete ML-based clinical decision support system, concerning weaning of patients from mechanical ventilation, to demonstrate how explainable AI methods can be employed to produce evidence to support safety assurance. The results are also represented in a safety argument to show where, and in what way, explainable AI methods can contribute to a safety case. Overall, we conclude that explainable AI methods have a valuable role in safety assurance of ML-based systems in healthcare but that they are not sufficient in themselves to assure safety.



There are no comments yet.


page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many domains there are well-established approaches and standards for assuring safety-critical systems and software, e.g., [bell2006introduction]. Assurance means establishing justified confidence in the system for its intended use. The assurance principles underlying these standards include validating that the system works as intended, verifying that the system meets explicit safety requirements (that help to define what is meant by “work as intended”), and ensuring robustness. These assurance principles remain essential for systems employing machine learning (ML) [hawkins2013principles]. However, the details of these approaches and standards can be difficult to apply where systems use ML. One of the reasons for this is the “black box“ (opaque) nature of the ML models [watson2019clinical]. The ML community is actively studying “explainablity”, that is methods intended to make the ML models human interpretable.

In healthcare, ML is used on various problems, e.g. learning optimal treatments, or to detect abnormalities in radiology images, where it is hard to derive clear and specific requirements and human performance is often used as a “gold standard”. The current practice is often to (seek to) achieve performance that is better than humans. This makes validation difficult as human performance is highly variable both from individual-to-individual and over time for a single individual. Also, performance will vary from patient-to-patient, e.g., with comorbidities, and clinicians might not agree on the best treatment strategy. To overcome such problems, it is important to use explainability to communicate the underlying workings of the ML models, with the particular aim of assuring the safety and effectiveness of the ML models.

Often explainability is equated with producing explainable artificial intelligence (AI) methods which seek to provide human interpretable representations of ML models

[challen2019artificial] but we take a broader view including data as it is inevitable to explain the data when trying to explain the model. Explainable AI methods produce explanations which can be local, i.e., relate to a specific output or prediction of an ML model or global, i.e., explain the ML model as a whole. Explainable AI methods can therefore, in principle, have a role in validation by giving stakeholders, including clinicians, assurance that the ML model will produce valid predictions beyond the data used in development. We will consider the role of various explainable AI methods in safety assurance. Our focus is on development activities and deployment decisions for ML-based systems. Operations and incident investigation are outside the scope of this paper, although we briefly consider the potential role of explainability in operations.

As well as the problem of opacity, another reason why it is difficult to apply existing safety approaches and standards to systems using ML is that the established approaches are based, implicitly or explicitly, on the V life-cycle model moving from requirements, through design onto implementation then testing. In contrast, the development of ML-based systems follows a very different, much more iterative, life-cycle with four main phases: data preparation, model selection, model learning, and model verification & validation which makes it hard to apply established methods. Some emerging standards and guidance, e.g.,the US Federal Drug Administration (FDA) proposed regulatory framework on AI/ML-based Software as a Medical Device (SaMD) [us2019proposed], UL 4600 [koopman2019safety] and Assurance of Machine Learning for Autonomous Systems (AMLAS) [hawkins2021guidance] better reflect the ML life-cycle. However, these approaches are not yet mature and our aim is to contribute to their evolution, particularly in the area of safety validation.

The primary contributions of this paper are as follows. First, identifying the ways in which explainable AI methods can contribute to safety assurance of ML-based systems. Second, demonstrating the role of various explainable AI methods using a clinical case study concerned with predicting readiness for extubation from mechanical ventilation. Third, showing the potential use of these methods in supporting a safety case for ML systems when used in healthcare.

The rest of the paper is structured as follows. Section 2 amplifies on some of the concepts set out above and identifies relevant related work. Section 3 outlines the potential role of explainable AI methods in the ML life-cycle. These possibilities are then illustrated in Section 4 using an example of weaning patients from mechanical ventilation. This is followed by a discussion and conclusions in Sections 5 and 6, respectively.

2 Background & Related Work

This section discusses established approaches to assurance of safety-critical systems and identifies their limitations when dealing with systems employing ML. This is followed by an introduction to the concepts of explainability and an overview of the different types of explainable AI methods.

2.1 Established Assurance Approaches and the Challenges of ML

We use the term assurance to mean confidence that the system behaviour is as intended in the environment of use, where as intended includes being safe. In this context, we are interested in assurance of patient safety when ML-based systems are used in a healthcare context.

Most approaches to assurance emphasise verification and validation, although the definitions of the terms can vary. The International Medical Devices Regulator Forum (IMDRF) define the terms as follows:

  • Verification – confirmation through provision of objective evidence that specified requirements have been fulfilled [imdrf2017software];

  • Validation – confirmation through provision of objective evidence that the requirements for a specific intended use or application have been fulfilled [imdrf2017software].

To interpret these definitions we can say that validation is concerned with building the right system, including defining requirements that meet our intent and that verification is concerned with building the system right by verifying that the system meets these requirements. Verification and validation (V&V) need to encompass safety requirements and, in traditional approaches, a range of safety engineering methods are used to identify derived safety requirements (DSRs) on the system. As a brief summary, traditional safety engineering methods first identify hazards, i.e., undesirable situations that pose risk to life, then determine potential causes

of the hazards and estimate risk associated with the hazards. Typically, risk is a combination of the likelihood of the hazard occurring and the severity of the harm arising from the hazard, although the detailed computations vary from domain to domain. Where risks are deemed too high, DSRs are identified to reduce the likelihood of hazard occurrence, e.g. by controlling hazard causes, or to mitigate the consequences of the hazard should it arise. In healthcare, such ideas underpin some of the relevant standards, e.g.,

[DCB] produced by NHS Digital in the UK, and international standards, e.g. [iso2019].

We have previously investigated how to adapt traditional safety engineering processes to healthcare systems which employ ML. We have shown that, in some cases, it is possible to adapt classical safety methods to identify hazards and then to establish DSRs on the ML elements of systems [jia2021safety]. Clinical judgement is also needed for validation of the DSRs and this enables traditional approaches to assurance to be adapted for ML. Where requirements are not stated explicitly, explainable AI methods can help by providing explanations that enable direct validation of the ML model as a whole, e.g. predictions are based on valid clinical factors and consistent with clinical knowledge.

In many domains, including healthcare, it is accepted good practice for the safety work to culminate in the production of a safety or assurance case, see [DCB]. In general, a safety case is “an argument, supported by evidence, that a systems is safe to be deployed in its context of use” [standard2007standard]. It is common to express the argument graphically, e.g., using the Goal Structuring Notation (GSN) [GSNCommu77:online], as a means of making the argument clear and open to scrutiny and review. The evidence underpinning the safety case includes the results of hazard and risk analysis, as well as the outputs from V&V activities and in this paper we will show that explainable AI methods can also provide such evidence.

There are a number of initiatives concerned with the assurance of ML in safety-critical systems both in healthcare and more generally. For example, AMLAS defines a process for assurance of the safety of ML-based systems to reflect the ML development life-cycle, which identifies both evidence artefacts and argument patterns (standard forms of argument that can be instantiated for a particular system) in GSN. AMLAS also considers issues of the robustness of ML-based systems, e.g., response to unexpected inputs. The FDA framework builds on the IMDRF work on risk categorisation for SaMD [imdrf2014software]. It also proposes a total life-cycle regulatory approach for ML-based SaMD [us2019proposed]. However, neither AMLAS nor the FDA framework consider the role of explainability explicitly. The work we present here is intended to be complementary to, and build on, these approaches.

It is common for safety cases to be produced incrementally and consolidated at the end of development (pre-deployment), or at critical decision-points in the development process, e.g. prior to clinical trials. In this paper, when discussing safety cases we focus on providing assurance to support the decision to make the transition from development to operation and on the role of explainable AI at this stage (but we acknowledge that the scope of safety cases is much broader).

In addition, it is always desirable to consider assurance “through life”, as proposed by the FDA [us2019proposed], not just as an activity undertaken prior to deployment. This includes getting feedback from operations to check whether or not the assumptions made in pre-deployment assurance activities are sound. This is even more important for ML-based systems than it is for “conventional” systems because of the opacity of ML models. However, there are other important aspects of using explainable AI methods for operational assurance for ML-based systems including the need to show compliance with legal frameworks such as the General Data Protection Regulations (GDPR) [goodman2017european]. Further, as performance criteria for ML models tend to give only statistical assurance, e.g., 93% accuracy, explainable AI methods can have an important role in giving concrete insights to system users, e.g., clinicians, related to a specific prediction. Explainable AI methods might also have a role in accident and incident investigation, see [mcdermid2021philtransa] for a discussion, but this is outside the scope of this paper.

2.2 Explainable AI Methods

ML includes a range of different methods such as decision trees


, support vector machines


and neural networks (NNs)

[jain1996artificial]. The study of explainable AI methods seeks to provide insight into how and why ML models make such predictions. Work on explainable AI includes formalising definitions of explainability[doshi2017towards] [lipton2018mythos], development of explainable AI methods themselves and establishing evaluation methods. In this section we provide a brief introduction to some relevant explainable AI methods. There are many different ways to categorise explainable AI methods, e.g. local or global based on the scope of the explanation. Here we present explainable AI methods in three different classes based on the explanation generating mechanism, as shown in Table 1.

Some ML models are perceived as intrinsically interpretable to the user, so we refer to these as interpretable models

. For example, the weights for linear regression can be viewed as providing a global explanation showing the relative importance of input features of the model. Further, linear regression can also provide a local explanation by multiplying weights with the specific input feature values. Decision trees are also viewed as intrinsically interpretable when the trees are shallow.

When it comes to explaining more complex ML models, e.g. NNs, which are not intrinsically interpretable, a post-hoc explanation can be used to provide insights without knowing the mechanisms by which the model works (e.g. by showing feature importance). There are two main post-hoc explanation classes: feature importance and example-based explanations. Feature importance is the more widely researched method [gilpin2018explaining], which can be model-agnostic (explainable AI methods that work for any class of ML model) or model-specific (explainable AI methods that work only for a given class of ML model). Example-based methods were relatively recently proposed and are often model-agnostic. We now describe each of these two classes in more detail.

2.2.1 Feature Importance Methods

Feature importance methods rank or score the input features based on their influence on the model prediction. There are two main ways to obtain the feature importance score, one is perturbation-based and the another is gradient-based.

Perturbation-based methods observe the difference between the original model prediction and the prediction after perturbation by removing, masking or altering an input feature or set of input features. This approach has wide applicability and can be used on image, tabular, or textual data [montano2003numeric] [liang2017deep]

. For example, perturbation was implemented for image classification by occluding different segments of an input image and observing the change in the predicted probability of the classification

[zeiler2014visualizing]. There are several popular perturbation-based methods.

LIME (Local Interpretable Model-Agnostic Explanations) [ribeiro2016should] provides local explanations by approximating a complex ML model with an interpretable model which can then be used to explain the prediction. LIME is based on the assumption that it is possible to fit an interpretable model around a single input sample that mimics the local behaviour of the complex ML model.

Several perturbation-based explainable AI methods are based on Shapley values

from cooperative game theory

[shapley1953value], which provide a way to assign the gain from a cooperative game to its players. Shapley values are used to explain a model prediction by treating input features as the players and the model prediction as the gain resulting from the cooperative game. Computing Shapley values is exponential in the size of the model input features, hence approximate methods have been proposed, e.g. aggregation based methods [bhatt2019towards] and Monte Carlo sampling [vstrumbelj2014explaining]. There are also approaches for graph-structured data such as natural language text and images [chen2018shapley].

SHAP (SHapley Additive exPlanations) [lundberg2017unified] is another approximation for Shapley values. KernelSHAP is a model agnostic weighted linear regression approximation of the exact Shapley value inspired by LIME. TreeSHAP [lundberg2020local2global] is an efficient estimation approach for tree-based models and is model-specific. The work on SHAP has wider significance as it has defined a new class of additive feature importance measures, unifying several existing explainable AI methods [lundberg2017unified].

Perturbation-based explainable AI methods tend to be very slow since they perturb a single input feature or set of features at a time, so the computational cost increases as the number of input features in the ML model grows. Further as complex ML models are typically nonlinear, explanation is heavily dependent on the (size of the) set of features that are perturbed at the same time. In contrast, gradient-based methods are potentially more efficient.

In essence, gradient-based methods calculate the gradient of the output with respect to the input. For example, in an image classification task a “saliency map” is produced by calculating the gradient of the output with respect to the input, identifying pixels that have a significant influence on the classification [simonyan2013deep]. There are a number of variants of gradient-based methods. Gradient * Input multiplies the gradient (strictly the partial derivative) by the input value to improve the sharpness of feature importance [shrikumar2016not]. Similarly, Integrated Gradients computes the average gradient of the output with respect to each input feature by integrating from a baseline to the current feature value [sundararajan2017axiomatic]. DeepLIFT

(Deep Learning Important FeaTures)

[shrikumar2017learning] works with deep NNs and it is a good approximation to Integrated Gradients [ancona2017towards]

. Similar to integrated gradients it also defines a “reference activation” which is often viewed as “uninformative” in context, e.g. a totally black image for image classification. It works by comparing the activation of each neuron to its “reference activation” and uses the difference to determine an importance score for each input.

2.2.2 Example-Based Methods

Example-based explanations explain the ML model by selecting particular instances from the dataset or creating new instances. It comprises counterfactual explanation, adversarial examples and influential instances, see Table 1.

Counterfactual explanations for ML models were introduced by Wachter et al [wachter2017counterfactual] but bear similarities to earlier work in psychology [kahneman1981simulation]. Counterfactuals can be thought of as “what is not, but could have been”. Counterfactual explanation is intended to produce a sparse human-interpretable example by changing some input features to achieve a different output, i.e. the desired output by the user (what could have been). To be useful, counterfactual explanations should minimise the distance from the current input. Identification of suitable metrics for measuring the distance is an active research topic, see for example [mothilal2020explaining], [sharma2019certifai].

Adversarial examples are typically generated by adding small, intentional perturbations to the input features to cause an ML model to make an incorrect prediction [szegedy2013intriguing]. There are many techniques to create adversarial examples, e.g. by minimising the distance between the adversarial example and the input instance, which is similar to counterfactual examples. However, adversarial examples are intended to deceive the ML model instead of interpreting the model. Therefore, the changes in the inputs are often imperceptible for a human observer, which makes it more popular for use in object classification [xie2020adversarial] [jia2017adversarial] [sato2018interpretable]. For example, adversarial images have been added to the training dataset to improve model robustness [ayers2020parot].

Influential instances

are intended to identify which input instances have a strong effect on the trained model by treating the model as a function of the training data rather than fixed. Two approaches for identifying influential instances are often used – deletion diagnostics and influence functions. Deletion diagnostics is not practical for big training datasets as it needs to remove a single training instance every time to observe the effect of this instance until the effect of all of the training data has been observed. Rather than deleting the training instance, influence functions up-weight the instance in the loss function by a very small amount in order to measure the effects of this instance on the model parameters or predictions. It is an approximation method, but more computationally efficient which is especially important when the training dataset is very large (see Section

4.2 for more details on the use of influence functions).

Type of explanation Scope
Model Specific
Examples of explainable AI methods
Interpretable Models Global Specific
A model by itself interpretable for the

user, e.g. linear/logistic regression,

decision tree
Local Agnostic LIME
Local Agnostic KernelSHAP
Local Specific TreeSHAP
Local Specific Gradient * Input
Local Specific Integrated Gradient
Local Specific DeepLIFT
Local Agnostic Influential instances
Local Agnostic Counterfactual explanations
Local Agnostic Adversarial examples
Table 1: Categorisation of Explainable AI Methods with Examples

3 Explainability in the ML life-cycle

In order to explore the role of explainability in assuring safety of ML in healthcare, it is important to consider it in the context of the ML life-cycle. The development process for ML typically includes data management, model selection, model learning and model V & V, as shown in Figure 1. Therefore the role of explainability for our concern is two-fold: explaining the data and explaining the ML model itself. Explainable AI methods, such as those described in Section 2.2, can be employed to explain ML models.

Figure 1 also makes explicit the need for a deployment decision prior to operation (which may be supported by a safety case). It also shows the relevant stakeholders who might be interested in the explanations in the different phases. Our focus here is on the development activities, but we briefly consider the potential role of explainability in operation, see Section 3.5; for a discussion of the wider role of explainability including incident and accident investigation see [mcdermid2021philtransa].

In the rest of this section we discuss the role of explainability against each stage of the development process shown in Figure 1.

3.1 Data Management

The first phase of the ML development process is data management. The Royal Society’s Policy Briefing on Explainable AI emphasises that data quality and provenance is part of the explainability pipeline, specifically saying that “Understanding the quality and provenance of the data used in AI systems is therefore an important part of ensuring that a system is explainable” [policybriefing]. This includes showing that the data comes from appropriate sources to address the problem concerned by the ML model. A widely accepted, harmonised framework for assessment of EHR data quality highlights conformance, completeness and accuracy [kahn2016harmonized]; we prefer accuracy to the original term plausibility because plausibility means that the values are in the possible range but accurate means that the data is not only possible but correct. These criteria would be applicable to any ML systems developed using EHR data. In addition to these three criteria we also identify data relevance and data balance as being particularly important to the development of ML models [hawkins2021guidance] [paterson2021assuring]. As real world data may contain biases, contain errors, and be incomplete, explaining how these five criteria are met can be at least as important as explaining the ML model itself.

For safety assurance, a safety case would need to address all five of the criteria. The evidence to ensure data quality is essentially technical, for example data conformance would include showing that data observes defined formats, e.g. correct units for weight [kahn2016harmonized]. However, demonstrating data relevance and data balance would include a judgement that the training data contained clinically relevant factors and are balanced for the problem being addressed; to do this requires clinical expertise. However, we acknowledge that often it is not possible to choose data that gives both feature balance and class balance. Instead, it might be useful to explain that some important features are reasonably balanced, e.g., gender, if the model is intended to be used for both male and female patients. In terms of class balance, this has long been an active research area in ML community [batista2004study]. It should be noted that data management is both crucial and labour intensive. Indeed, it may consume more effort than the rest of the ML life-cycle. Thus, arguments about data management will be an important part of the ML safety case.

Figure 1: Process for development and use of an ML System

3.2 Model Selection

The second phase in the development process is model selection. It is important to understand what kind of problem is being addressed and what kind of ML methods are suitable for the problem at hand. For example, if the problem is to identify optimal treatments in healthcare, then reinforcement learning (RL) might be more appropriate than others, as RL is widely used in complex decision making tasks to find an optimal policy

[sutton1998introduction]. On the other hand, if the problem is image classification then NNs might be more appropriate. In addition, another important aspect to consider at this stage is the explainability of the ML model. In Section 2.2 we identified that some ML models are intrinsically interpretable whereas others need to be supplemented with post-hoc explainable AI methods. Guidelines on model selection, balancing model performance against explainability, have been proposed [markus2020role].

When it comes to model selection, safety requirements are often implicitly transformed into explainability, robustness and performance requirements. Note that sometimes people make statements such as “use of deep NNs is not safe”. When they make this kind of statement, they are implicitly making the judgement that deep NNs are opaque, i.e., not explainable. This is why we argue that safety requirements are partially transformed into explainability requirements but there is a necessary trade-off between explainability and performance [markus2020role]. These requirements can only be met by considering the ability to produce effective explanations either in later phases of development or in operation. The rationale for the model choice, including the performance-explainability trade-offs, needs to be documented in the safety case.

3.3 Model Learning

The third phase in the development process is model learning. For model learning, hyperparameter selection, loss function definition and class balance need to be considered in order to meet safety requirements. In addition, explainable AI methods can help in terms of failure class understanding and robustness. At this stage two particular explainable AI methods are relevant. One is adversarial examples and the other is influential instances, see Section


  • Adversarial examples are often added to training data to improve model robustness in object classification tasks. This is often referred to as adversarial training or robustness training [huang2015learning] [ayers2020parot]. This is becoming widespread in domains such as autonomous vehicles, for example in improving performance at reading road signs under adverse conditions [eykholt2018robust], but we believe it has wider applicability, e.g. for image classification in radiology.

  • Influential instances are useful for “model debugging” as they can help to understand model behaviour and predictions by treating the model as a function of the training dataset, rather than being fixed [molnar2020interpretable]. Due to the computational cost, influential instances have not been widely used until recently with the availability of more efficient algorithms such as influence functions which have made it possible to implement the approach on large datasets [koh2017understanding]. Due to these algorithmic improvements, the use of influential instances will increase, helping to determine what data to include or to exclude in the model training in order to improve model prediction and help to debug the model.

Note that these forms of explainable AI methods are of particular interest to ML model developers, but they help to ensure the soundness of the learning process and thus contribute to safety assurance.

3.4 Model Verification and Validation

The final phase in the development process is model V&V. We believe that explainability has a general role in validation but could also have a role in verification if there are specific explainability requirements to verify. However, such explainability requirements need to be defined in a specific situation, therefore our focus here is on validation. We derived three distinct objectives, reconciling approaches proposed by the FDA [us2019proposed] and the IMDRF [imdrf2017software], which reflect key criteria for use of ML models in healthcare, although we note that explanations cannot guarantee that all these criteria are met [markus2020role].

First is performance, which can be measured using standard ML practices, e.g. evaluation of the proportion of false positives and false negatives, or the AUC-ROC. This is necessary but not sufficient to assure safety of ML, see [mcdermid2021philtransa].

The second objective is analytical or technical validation, showing that the software for the ML models is correctly constructed, and that it is accurate and reliable. Further, the ML models produce repeatable results, giving the same predictions from the same inputs. This objective can be met by employing established safety-critical software development practices including formal specifications, traceability from specification to implementation, use of test coverage criteria and static code analysis methods [McDermid2020SafetyOA]. We do not see a role for explainable AI methods for this aspect of validation.

Third is clinical validation which measures the ability of the system to generate a clinically meaningful output associated with the intended use of the system in its operational environment. Here we define two specific sub-objectives where we believe explainable AI methods have a role in supporting clinical validation:

  • Clinical association – demonstrate that the association between the system output and the targeted clinical condition in the intended population is supported by evidence;

  • Robustness – demonstrate the ability to distinguish the different classes of intended condition or recommended treatment without over-reliance on a specific input feature.

Feature importance explanations can help to demonstrate clinical association by showing that the output predictions are based on clinically meaningful and relevant factors of the input. This involves ranking input features based on their importance score or contribution score and making the rankings visible to clinicians so that they can exercise clinical judgement. In addition, this goal links back to data relevance and data balance, as data balance is directly related to the intended user population for the ML system, such as gender balance as we mentioned above. Thus clinical association is addressed from two perspectives: input features are relevant (data relevance) and outputs are based on relevant inputs (feature importance).

Example-based methods, especially counterfactual explanations, can help to assess model robustness. As we mentioned in Section 2.2 counterfactuals are generated by minimising the distance from the original input but producing a different prediction. Therefore, the further the distance from an initial input to a counterfactual, the more robust the ML model is, i.e. the model is “harder to fool”. Thus the distance measure between the initial input and its corresponding counterfactuals can be used to define a robustness score for the ML model, see for example [sharma2019certifai]. In an extreme case, if only one feature changed in the counterfactual examples from the original instance, this is analogous with a “single point of failure” which is a situation that needs to be avoided (the concept has origins in nuclear safety engineering [nuclear] but is now quite widely used in critical industries). Thus, counterfactuals can also help show that this standard safety criterion is met if multiple input features have to change to produce a different classification.

The use of explainable AI methods in support of ML model V&V will contribute evidence to the safety case, complementing other activities including performance assessment and safety-critical software engineering. It should be noted that explanations should be re-generated when the ML models are updated so that they accurately reflect the state of the models.

3.5 Operation

As discussed in Section 2.1, assurance should be considered to be a “through life” activity. This would include, for example, a clinician seeking assurance about a particular prediction, especially if acting on it can have a profound impact on patient safety. Explainable AI methods can play a role here. Local feature importance explanation is relevant but counterfactual examples also have a role, for example, helping a clinician to decide whether or not a proposed change in treatment is likely to bring about the desired effect for a particular patient. Again, the role and significance of explainability in operation is examined in more detail in [mcdermid2021philtransa].

4 Case Study

In this section we present a concrete healthcare case study to illustrate the role of explainable AI methods, as presented in Section 3. Note that the case study does not consider explanations for data management, but see our previous work for an illustration of the rationale for data inclusion for this case study [jia2021]. The case study illustrates many, but not all, of the explainable AI methods for safety assurance in ML development, and also briefly indicates the potential role of explainable AI methods during operation. The case study focuses on use of mechanical ventilation in Intensive Care Units (ICUs). Provision of mechanical ventilation is complex and consumes a significant proportion of ICU resources [wunsch2013icu] [ambrosino2010difficult]. It is of critical importance to determine the right time to wean the patient from mechanical ventilation. However, assessment of a patient’s readiness for weaning is a complex clinical task and it is potentially beneficial to employ ML to assist clinicians [kuo2015improvement]. The case study explores the use of ML to support clinicians in making weaning decisions.

Invasive mechanical ventilation is used when patients cannot breathe unaided, and requires the insertion of a tube into the trachea of the patient. The term intubation is used for insertion of tube and extubation for removal of the tube. From a patient safety point of view, both early and late weaning are problematic. Late extubation exposes a patient to discomfort and continued risk of complications such as pneumonia from prolonged intubation. Early extubation can lead to the need for re-intubation; this is referred to as “extubation failure”. The case study is particularly concerned with predicting patient readiness for extubation so as to avoid the negative side effects of mis-timed extubation. Put simply, the safety requirement is “prediction of readiness for extubation is timely”. The case study is based on the MIMIC-III data set [johnson2016mimic] and used a convolutional NN (CNN) to predict readiness for extubation in the next hour.

4.1 Model Selection

Model selection is strongly influenced by performance, as previously indicated. There are a range of performance metrics including false positives which, in this case study, would mean indicating that a patient is ready for extubation when it was actually premature. Here we use the AUC-ROC performance measure, which plots true positives against false positives. For a “random” model the AUC-ROC would be 0.5 and for a “perfect” model it would be 1.

For the case study, the performance of a number of ML models, including CNNs, were evaluated on the same dataset to support model selection, see Figure 2. CNNs have the best performance and more importantly, achieve better performance than intrinsically interpretable ML models such as decision trees. Logistic regression is the closest of the intrinsically interpretable ML models, but the performance difference is still considerable. As mentioned in section 3.2 there is a trade-off between performance and explainability. If performance over-rides the need for explainablility, then CNN should be chosen. On the other hand, if intrinsic interpretability is more important, then logistic regression should be chosen. In this case study, CNNs have been chosen, and post-hoc explainable AI methods can be used to help to explain the model, see the rest of the section for details.

Figure 2: Performance of ML models CNN Convolutional Neural Network ANN Artificial Neural Network LR Logistic Regression SVM

Support Vector Machine

DT Decision Tree RF Random Forest
Table 2: Legend for Figure 2

4.2 Model Learning

As we indicated in Section 3.3 there are two explainable AI methods that can be useful at this stage: adversarial examples and influential instances. Because adversarial examples are difficult to generate for tabular data, here we focus on the use of influential instances in their role for “debugging” ML models. This shows how they provide assurance about the appropriateness of the ML model learning process, in the context of the safety requirement.

When preparing the dataset for the case study, one issue that came up was whether or not to include the extubation failure patients. To be consistent with previous studies we defined extubation failure as the need for re-intubation within 48 hours [boles2007weaning] [esteban1995comparison]. Some of the literature suggests that premature extubation could cause extubation failure [ambrosino2010difficult]. Therefore, the label in the dataset for extubation failure patients might not be optimal, so it might negatively influence the prediction. We can view this as a failure class as explained in Section 3.3. To explore this issue further, we trained two CNN models to predict the readiness for extubation in the next hour in order to observe the effect of extubation failure patients. In the first model, we excluded all of the extubation failure patients in the training dataset. In the second model, we included all of the extubation failure patents in the training dataset. The accuracy of the second model is slightly reduced by comparison with the first model. We selected one of the test instances that was “interesting” in that the two models produced different predictions. For this instance, the first model predicted the patient should continue to be intubated, which is also the true (correct) label. However, the second model predicted that the patient was ready for extubation in the next hour. We used influence functions to identify the influential training instances for this test instance.

The key idea behind influence functions is to up-weight the loss of a training instance by an infinitestimally small step , which results in new model parameters:


where is the model parameter vector and is the model parameter after upweighting z by . L is the loss function used for training the model. The influence of upweighting on the parameters was solved by Cook and Weisberg [cook1982residuals] as follows:


Where is the Hessian matrix and is the loss gradient with respect to the parameters for the training instance .

Therefore, we can approximate the effect of removing training instance by upweighting it by without retraining the model, i.e. . In this work, we use the influence functions algorithm developed by Koh and Liang [koh2017understanding].

Figure 3: Top 30 most influential training instances
Figure 4: Distribution of influential instances: negative influence outweigh positive influences

Figure 3 shows the top 30 most influential training instances for this test instance. From the figure, it can be seen that among the training instances with a negative influence on the prediction, there are three instances of patients who had extubation failure. This shows that including the extubation failure patients made the predictions for the test instance worse. This suggests that the extubation might have been premature for the extubation failure patients, so their inclusion could make the prediction ready to extubate when they are not. Figure 4 shows some of the most influential data points from the extubation failure patients and that more of them have a negative influence than a positive influence. This also explains why the prediction for this test instance is wrong in the second model. Thus, we decided to exclude the extubation failure patients from the training dataset and the first CNN model was taken forward to the V&V stage. It is recognised, however, that this is a difficult clinical judgement – but that is exactly the reason for generating the explanations, since this gives the clinicians a better understanding of model behaviour as a basis for making the judgement.

In summary, the use of influential instances has helped to show an appropriate process for meeting safety requirements and it explicitly contributes to the safety requirement “to extubate in a timely manner”.

4.3 Model Verification and Validation

In this section, we focus on clinical validation, as set out in Section 3.4, and illustrate the use of explainable AI methods for demonstrating clinical association and robustness. We do not consider analytical validation here.

4.3.1 Feature importance explanations

Here we illustrate the role of feature importance in satisfying the clinical association safety assurance objective. This is done using DeepLIFT [shrikumar2017learning] which is a model-specific explainable AI method for deep NNs. An overview of the results of using DeepLIFT is shown in Figure 5; these values are averaged over the whole dataset, so this can be viewed as global feature importance.

Figure 5: Feature Importance for the CNN Model

The feature ranking correlates well with clinical expectations, helping to give confidence in the model. Those features that score near zero in Figure 5, e.g. ethnicity, gender and age, have little influence on the weaning decision. Patients who are undergoing invasive mechanical ventilation are often sedated to maintain physiological stability and to control pain levels. Sedation is reflected in the Richardson Agitation Scale (RAS) with negative values representing sedation and 0 meaning that they are alert and calm, thus more likely to be suitable for extubation. This is of particular practical importance as sedation is under the clinician’s direct control. Also, a positive Spontaneous Breathing Trial (SBT) result contributes to confidence that the patient is ready to start extubation. Other factors with a strong weight, e.g. inspired O2 fraction, ventilator category, peak inspiratory pressure, and positive end-expiratory pressure (PEEP) set, are also important clinical factors that a doctor will consider in the weaning decision-making process. Thus the feature importance ranking is indicative of a valid clinical association. Note, it is a clinical judgement whether or not this ranking is appropriate. Nonetheless, the benefit of the feature importance results is that they enable clinical judgement to be applied despite the opacity of the CNN model, and hence this contributes to safety assurance. Also, feature importance is of most value in making the behaviour of the ML model visible to clinicians, rather than directly to patients.

4.3.2 Counterfactual explanations

Features Original instance Counterfactual Examples
1 2 3 4
Admit Type Emergency
Ethnicity White
Gender Female
Age 78.2
Admission Weight 86.5
Heart Rate 119 110
Respiratory Rate 24 26 21
SpO2 98 96
Inspired O2 Fraction 100% 40%
PEEP set 10 5 5 5 0
Mean Airway Pressure 14 10 10
Tidal Volume (observed) 541 560
PH (Arterial) 7.46
Respiratory Rate(Spont) 0 24 21
Richmond-RAS Scale -1 0 1
Peak Insp. Pressure 21
O2 Flow 5 10
Plateau Pressure 19
Arterial O2 pressure 124 108 118
Arterial CO2 Pressure 33
Blood Pressure (systolic) 101
Blood Pressure (diastolic) 65
Blood Pressure (mean) 76
Spontaneous breathing trials
No result
Ventilator Mode
Predicted outcome 0.93 0.44 0.17 0.36 0.46
Table 3: Counterfactual examples for a given original instance

The final concern in model V&V is robustness of the ML model and here we show how to use counterfactuals to demonstrate robustness. Table 3 shows a set of counterfactual examples for a particular patient identifying which features need to change in order to “flip” the prediction from continued intubation to extubation. The left hand column shows the 25 features used by the model and the prediction of the ML model is included in the bottom row. The original instance is shown first, with the four rightmost column showing counterfactual examples. These counterfactual examples have been generated using Diverse Counterfactual Examples (DiCE) [mothilal2020explaining]. Certain features cannot be varied, e.g. age and gender; the dashes in the rightmost four columns indicate no change from the original input. The change in prediction is shown in the bottom row.

Identifying counterfactual examples is undertaken by minimising the distance from the original instance to a counterfactual that produces a different prediction. Thus, given the way these counterexamples are generated, it can help to gain confidence in model robustness and the absence of single points of failure. In this case, as shown in Table 3, a number of features have to change to “flip” the prediction. However, one instance is not sufficient to show ML model robustness. There would be a substantial computational cost in doing this for more of the input instances in the dataset in order to generate a robustness score as defined in [sharma2019certifai].

Another use of counterfactual examples is to inform clinician judgement which is considered in Section 4.4 below.

4.4 Operational use of the ML Model

The operation of ML models is often uncertain. Thus there is merit in extending the notion of assurance to operation, providing support to a clinician to give confidence to act on the particular model prediction. One way of approaching this is to use local explanations.

Figure 6 visualises the feature importance values for a single patient, i.e. local feature importance. Here, a positive feature importance score contributes to moving the output towards intubation being continued. In contrast, a negative feature importance score contributes to moving the output towards extubation. The sum of the positive contributed features are far greater than the sum of the negative contributed features, thus the prediction for this patient, for the next one hour, is to remain intubated.

Figure 6: Feature Importance for a single patient

However, clinicians might want to find out when the patient would be ready to extubate. This brings us back to counterfactuals. The counterfactual examples shown in Table 3 are for the same patient shown in Figure 6, and could potentially help the clinician to identify actions to take so that the patient becomes ready to extubate. As shown in the table, it is beneficial to generate multiple counterfactual examples, so that the clinicians can choose one that is most practical to implement. In the counterfactual examples shown, changes in the ventilator mode and SBT successfully completed would both indicate progress towards extubation. Note our model has not been used in operation yet, so the material presented here just illustrates the possibilities.

4.5 Safety arguments

As explained in Section 2.1, it is common practice to present the arguments and evidence that provide assurance that a system is acceptably safe to deploy in a safety case. In this case study, the safety argument is presented using GSN. Before we describe the safety argument we have developed, we briefly introduce the notation.

Figure 7: Goal Structuring Notation
Figure 8: Partial Safety Argument for Weaning ML Model emphasising Explainability

A legend showing the key elements of GSN is presented in Figure 7; a detailed description of the notation can be found in [GSNCommu77:online]. The goals – claims that we wish to make and support – are shown as rectangles and they can be decomposed into sub-goals, thus forming a tree. Goals are understood in a context – for example, the operating environment for the system or the safety requirements. Where the decomposition of goals is not obvious this is explained through a strategy, represented as a rhombus. The leaf-level goals are supported by solutions, represented as circles; the solutions provide references to evidence that supports the argument. Incomplete parts of the argument are shown with a diamond, meaning that part of the argument is to be developed.

Figure 8 presents a partial safety argument for the weaning case study, emphasising the role of explainability. The top goal (G0), which states that the ML model meets its safety requirement, is set out in the context of the definition of the ML model and the associated safety requirement – that “prediction of readiness for extubation is timely”.

The top-level argument strategy is decomposition across the stages of the development process. As the paper does not consider data management and analytical/technical validation in detail, the corresponding goal (G1) and goal (G5) are left undeveloped. The amber solutions (S3, S5 and S6) reflect explainable AI methods.

G2: Model selection reflects explainability – this is supported by the analysis in Section 4.1 (S2) which shows that the CNN outperforms other available ML methods, and suitable post-hoc explainable AI methods are available.

G3: Model learning reflects safety requirement – this is directly, but partially, supported by the use of influential instances (S3) which show the rationale for excluding extubation failure patients. Note that other evidence is needed (hence the goal is shown as needing development), e.g. to show appropriateness of parameter selection for training the model.

G4: Model V & V shows safety requirements met – this is broken down into G5: analytical/technical validation, which is undeveloped, G6: performance demonstrated supported by S4 and G7: clinical validation which is decomposed into two sub-goals covering the V&V criteria introduced in Section 3.4.

G6: Performance demonstrated – this is directly supported by the AUC-ROC performance in Figure 2 (S4) which shows the superiority of the CNN performance to other ML models.

G8: Valid clinical association demonstrated – this is supported by the feature importance (S5) although it should be noted that clinical judgement is needed to assess the appropriateness of the feature ranking.

G9: Robustness demonstrated – this is partially supported by the counterfactuals in Table 3 (S6) (this is a partial solution to G9 as the explanations only relate to a single prediction, and there are also other ways to demonstrate robustness).

As noted above, this is incomplete and the evidence presented in this paper should not be taken as sufficient to justify deployment of the CNN model described here in a clinical context. However, it is a valuable part of an overall safety case.

5 Discussion

Safety assurance of ML models in healthcare is an active area of research. Although explainability is often said to help in safety assurance of ML, no work so far has explored the possibilities systematically and identified precisely how explainability can help safety assurance. This paper seeks to fill this gap. We illustrated how explainability can help in safety assurance in the context of the ML development process. Explainability used here includes explaining the data and the use of explainable AI methods to reflect the Policy Briefing on explainable AI from the Royal Society [policybriefing]. The role of the different explainable AI methods in the development and operation phases is summarised in Table 4 along with the interested stakeholders. We first extrapolated the safety objectives at the different stages of the ML development process. Then we used a concrete healthcare case study to demonstrate how explainable AI methods can help to meet these safety objectives, particularly in model learning and model V&V. Specifically, we have shown the value of influential instances for model learning, which is of particular interest to ML developers. Further, we have shown the value of feature importance and counterfactuals in model V&V, which is of particular interest to ML developer, regulators and others involved in deployment decisions, see Figure 1.

Phases Activity Explainable AI methods Stakeholders
Development Data Management N/A
ML developers
Hospital managers
Model Selection
Trade-off performance &
ML developers
Model Learning
Adversarial examples
Influential instances
ML developers
Model V & V
Global feature importance
Counterfactual explanations
ML developers
Hospital managers
Operation Decision Support
Local feature importance
Counterfactual explanations
Expert users – clinicians
Decision recipients – patients
Table 4: Role of Explainable AI Methods in the development and operation phases

As indicated earlier, although the use of explainable AI methods can contribute to safety assurance, it is not enough to assure safety by itself. In the rest of this section, we will highlight some relevant complementary methods that also contribute to safety assurance.

First, any safety critical software should be developed in a quality management framework, see for example [imdrf2015software]. For all software, quality management includes configuration control, traceability from requirements to implementation and test, and change control. For SaMD it should also assure the quality of data used for training ML models [kahn2016harmonized]. From a safety assurance point of view, the aim of quality management is to ensure that the evidence produced to support the safety case properly reflects the build state of the system that is to be deployed. Whilst none of this is new, it may be challenging for AI/ML-based SaMD due to the highly iterative nature of the ML development process.

Second, it is important to apply established methods from safety critical software engineering, adapted as necessary for AI/ML-based SaMD. One such method is static analysis, that is analysing the code without executing it, looking for “bugs”, such as division by zero or using the wrong type of data; see [McDermid2020SafetyOA] for an illustration of applying static analysis to ML code in healthcare. It is also standard practice to measure test coverage of the software when undertaking V&V. For conventional software, it is common to use structural coverage, e.g. ensuring that all branches in the code have been executed at least once. The obvious analogy for NNs is neuron coverage [sun2019structural], although there is some debate about whether or not this is an appropriate criterion [li2019structural]. Nonetheless, coverage is significant when considering safety, as assurance is clearly undermined if there are significant parts of the ML model for which we have no test evidence. Consequently, it seems likely that understanding of what are appropriate coverage criteria will improve as experience of using AI/ML-based SaMD increases. In the context of this paper, static analysis seems most readily applicable.

Third, there are assurance methods that address the specific challenges of V&V for AI/ML-based software. We briefly consider two methods that are relevant to deep NNs. It is possible to apply formal methods (mathematical techniques of verification) to deep NNs. For example [huang2017safety] applies Satisfiability Modulo Theories (SMT) solvers to find adversarial examples for NNs used in image analysis. Further, the ideas of concolic testing, which seeks to maximise code coverage, have been applied to deep NNs [sun2019deepconcolic]. This work addresses structural coverage, including neuron coverage, and other properties such as Lipschitz continuity. It uses symbolic approaches to generate inputs to improve test coverage to generate a test suite for a given deep NN and also assists in finding adversarial examples.

Finally, these methods can support regulatory processes, particularly those focusing on AI/ML-based SaMD. Our aim here was not to propose alternatives to regulatory processes, but to identify where explainability could help to provide safety assurance evidence to support those processes.

6 Conclusions

To our knowledge, this is the first systematic attempt to explore the role of explainability in assuring safety of ML, with a particular focus on pre-deployment decision-making. We believe this will be of particular interest to regulators, as it illustrates how to use explainable AI methods to provide evidence to support relevant safety objectives, e.g. for clinical association, articulated by the FDA and IMDRF.

The case study illustrates the practical use of explainable AI methods in safety assurance. Specifically, it illustrates three different explainable AI methods:

  • Influential instances – for showing how to debug the ML model, including helping to define the most appropriate training dataset;

  • Feature importance – for showing valid clinical association;

  • Counterfactual explanations – for showing Ml model robustness and the absence of single-point failures.

From a safety assurance perspective, these uses of explainable AI methods contribute most to model learning and validation. The case study also shows how the use of these explainable AI methods feeds into a safety case, e.g. as required by healthcare standards [DCB]. Future work will include further exploration of explainable AI methods and development of further case studies with the aim of refining and validating the approach.

In addition, we believe it would be valuable to consider the role of explainable AI methods in accident and incident investigation for AI/ML-based SaMD. Being able to explain what happened may be crucial in order to learn from experience and to preserve confidence in a system. For example, it might be that counterfactual examples would help in understanding how an adverse event could have been avoided and thus indicate requirements for ML model retraining. This would help in achieving a total product life-cycle approach to managing risks of AI/ML-based SaMD as proposed by the FDA [us2019proposed].

The code for applying various explainable AI methods is available at: Yanjiayork/mechanical_ventilator.


This work is funded by Bradford Teaching Hospitals NHS Foundation Trust and supported by the Assuring Autonomy International Programme at the University of York. The views expressed in this paper are those of the authors and not necessarily those of the NHS, or the Department of Health and Social Care.