Reliability Assessment and Safety Arguments for Machine Learning Components in Assuring Learning-Enabled Autonomous Systems

The increasing use of Machine Learning (ML) components embedded in autonomous systems – so-called Learning-Enabled Systems (LES) – has resulted in the pressing need to assure their functional safety. As for traditional functional safety, the emerging consensus within both, industry and academia, is to use assurance cases for this purpose. Typically assurance cases support claims of reliability in support of safety, and can be viewed as a structured way of organising arguments and evidence generated from safety analysis and reliability modelling activities. While such assurance activities are traditionally guided by consensus-based standards developed from vast engineering experience, LES pose new challenges in safety-critical application due to the characteristics and design of ML models. In this article, we first present an overall assurance framework for LES with an emphasis on quantitative aspects, e.g., breaking down system-level safety targets to component-level requirements and supporting claims stated in reliability metrics. We then introduce a novel model-agnostic Reliability Assessment Model (RAM) for ML classifiers that utilises the operational profile and robustness verification evidence. We discuss the model assumptions and the inherent challenges of assessing ML reliability uncovered by our RAM and propose practical solutions. Probabilistic safety arguments at the lower ML component-level are also developed based on the RAM. Finally, to evaluate and demonstrate our methods, we not only conduct experiments on synthetic/benchmark datasets but also demonstrate the scope of our methods with a comprehensive case study on Autonomous Underwater Vehicles in simulation.



page 21

page 22

page 23

page 30


Ergo, SMIRK is Safe: A Safety Case for a Machine Learning Component in a Pedestrian Automatic Emergency Brake System

Integration of Machine Learning (ML) components in critical applications...

A Hierarchical HAZOP-Like Safety Analysis for Learning-Enabled Systems

Hazard and Operability Analysis (HAZOP) is a powerful safety analysis te...

Enabling Cross-Layer Reliability and Functional Safety Assessment Through ML-Based Compact Models

Typical design flows are hierarchical and rely on assembling many indivi...

Quantifying Assurance in Learning-enabled Systems

Dependability assurance of systems embedding machine learning(ML) compon...

Assessing Safety-Critical Systems from Operational Testing: A Study on Autonomous Vehicles

Context: Demonstrating high reliability and safety for safety-critical s...

If a Human Can See It, So Should Your System: Reliability Requirements for Machine Vision Components

Machine Vision Components (MVC) are becoming safety-critical. Assuring t...

Logically Sound Arguments for the Effectiveness of ML Safety Measures

We investigate the issues of achieving sufficient rigor in the arguments...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Industry is increasingly adopting AI/ML algorithms to enhance the operational performance, dependability, and lifespan of products and service – systems with embedded ML-based software components. For such LES, in safety-related applications high reliability is essential to ensure successful operations and regulatory compliance. For instance, several fatalities were caused by the failures of LES built in Uber and Tesla’s cars. IBM’s Watson, the decision-making engine behind the Jeopardy AI success, has been deemed a costly and potentially deadly failure when extended to medical applications like cancer diagnosis. Key industrial foresight reviews have identified that the biggest obstacle to reap the benefits of ML-powered RAS is the assurance and regulation of their safety and reliability lane_new_2016. Thus, there is an urgent need to develop methods that enable the dependable use of AI/ML in critical applications and, just as importantly, to assess and demonstrate the dependability for certification and regulation.

For traditional systems, safety regulation is guided by well-established standards/policies, and supported by mature development processes and VnV tools/techniques. The situation is different for LES: they are disruptively novel and often treated as a black box with the lack of validated standards/policies BKCF2019, while they require new and advanced analysis for the complex requirements in their safe and reliable function. Such analysis needs to be tailored to fully evaluate the new character of ML alves_considerations_2018; burton_mind_2020; KKB2019, despite some progress made recently huang_survey_2020. This reinforces the need for not only an overall methodology/framework in assuring the whole LES, but also innovations in safety analysis and reliability modelling for ML components, which motivate our work.

In this article, we first propose an overall assurance framework for LES, presented in Claims-Arguments-Evidence (CAE) assurance cases bloomfield_safety_2010. While inspired by bloomfield2021safety, ours is with greater emphasis on arguing for quantitative safety requirements. This is because the unique characteristics of ML increase apparent non-determinism johnson_increasing_2018 that explicitly requires probabilistic claims to capture the uncertainties in its assurance zhao_safety_2020; asaadi_quantifying_2020; bloomfield2020assurance. To demonstrate the overall assurance framework as an end-to-end

methodology, we also consider important questions on how to derive and validate (quantitative) safety requirements and how to break them down to functionalities of ML components for a given LES. Indeed, there should not be any generic, definitive, or fixed answers to those hard questions for now, since AI/ML is an emerging technology that is still heavily in flux. That said, we propose a promising solution that we believe is the most practical for the moment: we exercise the HAZOP (a systematic hazards identification method)

swann_twenty_five_1995, quantitative FTA (a common probabilistic root-cause analysis) lee_fault_1985, and leverage existing regulation principles to validate the acceptable and tolerable safety risk, e.g., GALE to non-AI/ML systems or human performance.

Upon establishing safety/reliability requirements on low-level functionalities of ML components, we build dedicated RAM. In this article, we mainly focus on assessing the reliability of the classification function of the ML component, extending our initial RAM in zhao_assessing_2021 with more practical considerations for scalability. Our RAM explicitly takes the OP information and robustness evidence into account, because—(i) Reliability, as a user-centred property, depends on the end-users’ behaviours littlewood_software_2000, and the OP (quantifying how the software will be operated ML classifiers are subject to robustness concernsmusa_operational_1993) should therefore be explicitly modelled in the assessment; (ii) a RAM without considering robustness evidence is not convincing. To the best of our knowledge, our RAM is the first to consider both, the OP and robustness evidence. It is inspired by partition-based testing hamlet_partition_1990; pietrantuono_reliability_2020, operational/statistical testing strigini_guidelines_1997; zhao_assessing_2020 and ML robustness evaluation carlini_towards_2017; webb_statistical_2019. Our RAM is model-agnostic and designed for pre-trained ML models, yielding estimates of, e.g., expected values or confidence bounds on the pmi111This reliability measure is similar to the conventional probability of failure on demand (pfd), but retrofitted for classifiers..

Then, we present a set of safety case templates to support reliability claims222We deal with probabilistic claims in this part, so “reliability” claims are about probabilities of occurrence of failures, and “safety” claims are about failures that are safety-relevant. The two kinds do not require different statistical reasoning, thus we may use the two terms safety and reliability interchangeably when referring to the probabilities of safety-relevant failures. stated in pmi based on our new RAM—the “backbone” of the probabilistic safety arguments for ML components. Essentially, the key argument is over the rigour of the four main steps of the RAM: all perspectives of the RAM, including modelling assumptions, hyper-parameter selections, intermediate calculations and final testing results, should be presented, justified and organised in a structured way.

Finally, a comprehensive case study based on a simulated AUV that carries out survey and asset inspection missions is conducted. The case study in our simulator is both efficient and effective as a first step to demonstrate and validate our methods which, we believe, can be easily transferred to real-world case studies. All simulators, ML models, datasets and experimental results used in this work are publicly available at the our project repository with a video demo at

Summary of Contributions

The key contributions of this work include:

  • An assurance case framework for LES that: (i) emphasises the arguments for quantitative claims on safety and reliability; (ii) with an “end-to-end” chain of safety analysis and reliability modelling methods for arguments ranging from the very top safety claim of the whole LES to low-level VnV evidence of ML components.

  • A first RAM evaluating reliability for ML software, leveraging both the OP information and robustness evidence. Moreover, based on the RAM, templates of probabilistic arguments for reliability claims on ML software are developed.

  • Identification of open challenges in building safety arguments for LES and highlighting the inherent difficulties of assessing ML reliability, uncovered by our overall assurance framework and the proposed RAM, respectively. Potential solutions are discussed and mapped onto on-going studies to advance in this research direction.

  • A prototype tool of our RAM and a simulator platform of AUV for underwater missions that are reusable and extendable as a starting point for future research.

Organisation of this Article

After presenting preliminaries in Section 2, we outline our overall assurance framework in Section 3. After that, the RAM is described in details with a running example in Section 4, following by its probabilistic safety arguments for ML classification reliability in Section 5. We then present our case study on AUV in Section 6. Related work is summarised in Section 7, while in-depth discussions are provided in Section 8. Finally, we conclude in Section 9 and outline plans for future work.

2 Preliminaries

2.1 Assurance Cases, CAE Notations and CAE Blocks

Assurance cases are developed to support claims in areas such as safety, reliability and security. They are often called by more specific names like security cases knight_importance_2015 and safety cases bishop_methodology_2000. A safety case is a compelling, comprehensive, defensible, and valid justification of the system safety for a given application in a defined operating environment; it is therefore a means to provide the grounds for confidence and to assist decision making in certification bloomfield_safety_2010. For decades, safety cases have been widely used in the European safety community to assure system safety. Moreover, they are mandatory in the regulation for systems used in safety-critical industries in some countries, e.g., in the UK for nuclear energy uk_office_for_nuclear_regulation_purpose_2019. Early research in safety cases has mainly focused on their formulation in terms of claims, arguments, and evidence elements based on fundamental argumentation theories like the Toulmin model s_toulmin_uses_1958. The two most popular notations are CAE bloomfield_safety_2010 and GSN kelly_arguing_1999. In this article, we choose the former to present our assurance case templates.

A summary of the CAE notations is provided in Figure 1. The CAE safety case starts with a top claim, which is then supported through an argument by sub-claims. Sub-claims can be further decomposed until being supported by evidence. A claim may be subject to some context, represented by general purpose other nodes, while assumptions (or warranties) of arguments that need to be explicitly justified form new side-claims. A sub-case repeats a claim presented in another argument module. Notably, the basic concepts of CAE are supported by safety standards like ISO/IEC15026-2. Readers are referred to bloomfield2020assurance; bloomfield2021safety for more details on all CAE elements.

The CAE framework additionally consists of CAE blocks that provide five common argument fragments and a mechanism for separating inductive and deductive aspects of the argumentation333The argument strategy can be either inductive or deductive alves_considerations_2018. For an inductive strategy, additional analysis is required to ensure that residual risks are mitigated.. These were identified by empirical analysis of real-world safety cases bloomfield_building_2014. The five CAE blocks representing the restrictive set of arguments are:

  • Decomposition: partition some aspect of the claim—“divide and conquer”.

  • Substitution: transform a claim about an object into a claim about an equivalent object.

  • Evidence Incorporation: evidence supports the claim, with emphasis on direct support.

  • Concretion: some aspect of the claim is given a more precise definition.

  • Calculation (or Proof): some value of the claim can be computed or proven.

An illustrative use of CAE blocks is shown in Figure 1, while more detailed descriptions can be found in bloomfield_building_2014; bloomfield2021safety.

Figure 1: Summary of the CAE notations (lhs) and an example of CAE block use (rhs), cited from bloomfield2021safety.

2.2 HAZOP and Fta

HAZOP is a structured and systematic safety analysis technique for risk management, which is used to identify potential hazards for the system in the given operating environment. HAZOP is based on a theory that assumes risk events are caused by deviations from design or operating intentions. Identification of such deviations is facilitated by using sets of “guide words” (e.g., too much, too little and no) as a systematic list of deviation perspectives. It is commonly performed by a multidisciplinary team of experts during brainstorming sessions. HAZOP is a technique originally developed and used in chemical industries. There are studies that successfully apply it to software-based systems swann_twenty_five_1995. Readers will see an illustrative example in later sections, while we refer to crawley2015hazop for more details.

FTA is a quantitative safety analysis technique on how failures propagate through the system, i.e., how component failures lead to system failures. The fundamental concept in FTA is the distillation of system component faults that can lead to a top-level event into a structured diagram (fault tree) using logic gates (e.g., AND, OR, Exclusive-OR and Priority-AND). We show a concrete example of FTA in our case study section, while a full tutorial of developing FTA is out of the scope of this article, and readers are referred to ruijters_fault_2015 for more details.

2.3 OP Based Software Reliability Assessment

The delivered reliability, as a user-centred and probabilistic property, requires to model the end-users’ behaviours (in the operating environments) and to be formally defined by a quantitative metric littlewood_software_2000. Without loss of generality, we focus on pmi as a generic metric for ML classifiers, where inputs can, e.g., be images acquired by a robot for object recognition.

Definition 1 (pmi).

We denote the unknown pmi by a variable , which is formally defined as


where is an input in the input domain444We assume continuous in this article. For discrete , the integral in Eqn. (1) reduces to sum and becomes a probability mass function. , and is an indicator function—it is equal to when S is true and equal to otherwise. The function returns the probability that is the next random input.

Remark 1 (Op).

The OP musa_operational_1993 is a notion used in software engineering to quantify how the software will be operated. Mathematically, the OP is a PDF defined over the whole input domain .

We highlight this Remark 1, because we will use probability density estimators to approximate the OP from the collected operational dataset in the RAM we develop in Section 4.

By the definition of pmi, successive inputs are assumed to be independent. It is therefore common to use a Bernoulli process as the mathematical abstraction of the failure process, which implies a Binomial likelihood. When used for traditional software that, upon establishing the likelihood, RAMs on estimating vary case by case—from the basic MLE to Bayesian estimators tailored for certain scenarios when, e.g., seeing no failures miller_estimating_1992; bishop_toward_2011, inferring ultra-high reliability zhao_assessing_2020, with certain forms of prior knowledge like perfectioness strigini_software_2013, with vague prior knowledge expressed in imprecise probabilities walter_imprecision_2009; zhao_probabilistic_2019, with uncertain OPs bishop_deriving_2017; pietrantuono_reliability_2020, etc.

OP based RAMs designed for traditional software fail to consider new characteristics of ML, e.g., a potential lack of robustness and a high-dimensional input space. Specifically, it is quite hard to gather the required prior knowledge when taking into account the new ML characteristics in those Bayesian RAMs. At the same time, frequentist RAMs would require a large sample size to gain enough confidence in the estimates due to the extremely large population size (e.g., the high-dimensional pixel space for images). As an example, the usual accuracy testing of ML classifiers is essentially an MLE estimate against the test set, which has the following problems: (i) it assumes the test set statistically represents the OP, which is rarely the case; (ii) the test set is a very small fraction of the whole input space, thus limited confidence can be claimed in reliability; and (iii) without explicitly considering robustness evidence, the reliability claim for ML is not trustworthy.

2.4 ML Robustness and the -Separation Property

ML is known not to be robust. Robustness requires that the decision of the ML model is invariant against small perturbations on inputs. That is, all inputs in a region have the same prediction label, where usually the region is a small norm ball (in an -norm distance555Distance mentioned in this article is defined in if without further clarification.) of radius around an input . Inside , if an input is classified differently to by , then is an AE. Robustness can be defined either as a binary metric (if there exists any AE in ) or as a probabilistic metric (how likely the event of seeing an AE in is). The former aligns with formal verification, e.g. huang_safety_2017, while the latter is normally used in statistical approaches, e.g. webb_statistical_2019. The former “verification approach” is the binary version of the latter “stochastic approach”666Thus, we use the more general term robustness “evaluation” rather than robustness “verification” throughout the article..

Definition 2 (robustness).

Similar to webb_statistical_2019, we adopt the more general probabilistic definition on the robustness of the model (in a region and to a target label ):


where is the conditional OP of region (precisely the “input model” used by both webb_statistical_2019 and weng_proven_2019).

We highlight the follow two remarks regarding robustness:

Remark 2 (astuteness).

Reliability assessment only concerns the robustness to the ground truth label, rather than an arbitrary label in . When is such a ground truth, robustness becomes astuteness yang_closer_2020, which is also the conditional reliability in the region .

Astuteness is a special case of robustness777Thus, later in this article, we may refer to robustness as astuteness for brevity when it is clear from the context.. An extreme example showing why we introduce the concept of astuteness is, that a perfectly robust classifier that always outputs “dog” for any given input is unreliable. Thus, robustness evidence cannot directly support reliability claims unless the ground truth label is used in estimating .

Remark 3 (-separation).

For real-world image datasets, any data-points with different ground truth are at least distance apart in the input space (pixel space), and is bigger than a norm ball radius commonly used in robustness studies.

The -separation property was first observed by yang_closer_2020: real-world image datasets studied by the authors implies that is normally times bigger than the radius (denoted as ) of norm balls commonly used in robustness studies. Intuitively it says that, although the classification boundary is highly non-linear, there is a minimal distance between two real-world objects of different classes (cf. Figure 2 for a conceptual illustration). Moreover, such a minimal distance is bigger than the usual norm ball size in robustness studies.

Figure 2: Illustration of the -separation property.

3 The Overall Assurance Framework

In this section, we present an overall assurance framework for LES (e.g., AUV). The assurance framework is presented as a CAE assurance case template bloomfield_safety_2010, in which we highlight both the main focus of this work—a RAM for the ML component with its probabilistic safety arguments—and all its required supporting analysis to derive the reliability requirements on the low-level ML functionalities.

3.1 Overview of an Assurance Case for Les

Figure 3: Overview of the proposed safety case template for LES, highlighting the main focus and supporting content of the work.

To argue for TLSC1, we refer to the template proposed by (bloomfield2021safety, Chap. 5) as our sub-case SubC1. Essentially, in SubC1, we argue R is: (i) well-defined (e.g., verifiable, consistent, unambiguous, traceable, etc); (ii) complete that covers all sources (e.g., from hazard analysis and domain-specific safety standards/legislation); and (iii) valid, according to some common risk acceptance criteria/principles in safety regulations of different countries/domains, e.g., ALARP (As Low As Reasonably Practicable). Without repeating the content of bloomfield2021safety, we only highlight the parts directly supporting the main focus of this work (via the procedure in Figure 4), which are hazard identification (SubC2) and derivation of quantitative safety target (SubC3).

Similar to bloomfield2021safety, we use a decomposition CAE-block/argument to support TLC2. But, in addition to time-split, we also split the claim by the qualitative and quantitative nature, since the main focus of this work, SubC7, concerns the probabilistic reliability modelling of ML components. Further decomposition of the whole system’s quantitative requirements into functionalities of individual components (TCL3) is non-trivial, for which we utilise quantitative FTA. The decomposition requires a side-claim on the sufficiency of the FTA study TLSC2. A comprehensive development SubC8 for TLSC2 is out of the scope of this work, while we illustrate the gist and an example of the method in later sections. Finally, we reach the main focus of this work SubC7 and will develop the full sub-case for it in Section 5.

3.2 Deriving Quantitative Requirements for ML Components

As mentioned, in this work we are mainly developing low-level probabilistic safety arguments, based on the dedicated RAM for ML components developed in Section 4. An inevitable question is how to quantitatively determine the tolerable and acceptable risk levels of the ML components. Normally the answer involves a series of well-established safety analysis methods that systematically breaks down the whole-system level risk to low-level components, considering the system architecture littlewood_reasoning_2012; zhao_safety_2020. While, the whole-system level risk is determined on a case by case basis through the application of principles and criteria required by the safety regulations extant in the different countries/domains. To align with this best practice, we propose the procedure articulated in Figure 4, whose major steps correspond to the supporting sub-cases SubC2, SubC3 and SubC8.

Figure 4: The workflow of combining HAZOP and quantitative FTA to derive probabilities of basic-events of components.

In Figure 4, for the given LES, we first identify a set of safety properties that are desirable to stakeholders. Then, a HAZOP analysis is conducted, based on deviations of those properties, to systematically identify hazards (and their causes, consequences, and mitigations). New safety properties may be introduced by the mitigations identified by HAZOP, thus HAZOP is conducted in an iterative manner that forms the first loop in Figure 4.

Then, we leverage the HAZOP results to do hazard scenario modelling, inspired by guo_extended_2015, so that we may combine HAZOP and FTA later on. Usually, as also noted in guo_extended_2015, a property deviation can have several causes and different consequences in HAZOP analysis. It is complicated and difficult to directly convert HAZOP results into fault trees. Thus, hazard scenario modelling is needed to explicitly link the initial events (causes) to the final events (consequences) with a chain of intermediate events. Such event-chains facilitate the construction of fault trees, specifically in three steps:

  • The initial events (causes) may or may not be further decomposed at even lower-level sub-functionalities of components to determine the root causes, which are used as basic events (BE) in FTA. Thus, BEs are typically failure events of software/hardware components, e.g., different types of misclassifications, failures in different modes of a propeller.

  • Adding a specific logic gate among all intermediate events (IE) on the same level, which models how failures are propagated, tolerated and/or compounded throughout the system architecture.

  • Final events (consequences) are used as top events (TE) of the FTA. In other words, TEs are violations of system-level safety properties.

Upon establishing the fault trees, conventional quantitative FTA can be performed to propagate probabilities of BEs to the TE probability, or, reversely, to allocate/break-down TE probability to BEs. What-if calculations and sensitivity analysis are expected to find the most practical solution of BE probabilities that makes the required TE risk tolerable. Then the practical solution for the BE associated with the ML component of our interest becomes our target reliability claims for which we develop probabilistic safety arguments. Notably, the ML component may need several rounds of retraining/fine-tuning to achieve the required level of reliability. This forms part888Other non-ML components may be updated as well to jointly make the whole-system risk tolerable. of the second iterative loop in Figure 4. We refer readers to zhao_detecting_2021 for a more detailed description on this debug-retrain-assess loop for ML software.

Finally, the problem boils down to (i) how to derive the system-level quantitative safety target, i.e., assigning probabilities for those TEs of the fault trees; and (ii) how to demonstrate the component-level reliability is satisfied, i.e., assessing the BE probabilities for components based on evidence. We address the second question in the next section, while the first question is essentially “how safe is safe enough?”, for which the general answer depends on the existing regulation/certification principles/standards of different countries and industry domains. Unfortunately, existing safety standards cannot be applied on LES, and revisions are still ongoing. Therefore, we currently do not have a commonly acknowledged practice that can be easily applied to certify or regulate LES BKCF2019; klas2021using. That said, emerging studies on assuring/assessing the safety and reliability of AI and autonomous systems have borrowed ideas from existing regulation principles on risk acceptability and tolerability, e.g.,:

  • ALARP (As Low As Reasonably Practicable): ALARP states that the residual risk after the application of safety measures should be as low as reasonably practicable. The notion of being reasonably practicable relates to the cost and level of effort to reduce risk further. It originally arises from UK legislation and is now applied in many domains like nuclear energy.

  • GALE (Globally At Least Equivalent): is a principle required by French law for railway safety, which indicates the new technical system shall be at least as safe as comparable existing ones.

  • SE (Substantially Equivalent): similar to GALE; new medical devices in the US must be demonstrated to be substantially equivalent to a device already on the market. This is required by the U.S. Food & Drug Administration (FDA).

  • MEM (Minimum Endogenous Mortality): MEM states that a new system should not lead to a significant increase in the risk exposure for a population with the lowest endogenous mortality. For instance, the rate of natural deaths is a reference point for acceptability.

While a complete list of all principles and comparisons between them are beyond the scope of this work, we believe that the common trend is that, for many LES, a promising way of determining the system-level quantitative safety target is to argue the acceptable/tolerable risk over the average human-performance. For instance, self-driving cars’ targets of being as safe as or two-magnitude safer than human-drivers (in terms of metrics like fatalities per mile) are studied in kalra_driving_2016; zhao_assessing_2020; liu_how_2019. In picardi_pattern_2019, human-doctors’ performance is used as the benchmark in arguing the safety of ML-based medical diagnosis systems.

In summary, we are only presenting the essential steps of combining HAZOP and quantitative FTA via hazard scenario modelling to derive component-level reliability requirements from whole system-level safety targets, while each of those steps with concrete examples can be found in Section 6 as part of the AUV case study.

4 Modelling the Reliability of ML Classifiers

4.1 A Running Example of a Synthetic Dataset

To better demonstrate our RAM, we take the Challenge of AI Dependability Assessment raised by Siemens Mobility999 as a running example. The challenge is to firstly train an ML model to classify a dataset generated on the unit square according to some unknown distribution (essentially the unknown OP). The collected data-points (training set) are shown in Fig. 5-lhs, in which each point is a tuple of two numbers between 0 and 1 (thus called a “2D-point”). We then need to build a RAM to claim an upper bound on the probability that the next random point is misclassified, i.e., the pmi. If the 2D-points represent traffic lights, then we have 2 types of misclassifications—safety-critical ones, when a red data-point is labelled green, and performance related ones otherwise. For brevity, we only focus on misclassifications here, while our RAM can cope with sub-types of misclassifications.

Figure 5: The 2D-point dataset (lhs), and its approximated OP (rhs).

4.2 The Proposed RAM

Principles and Main Steps of the Ram

Inspired by pietrantuono_reliability_2020, our RAM first partitions the input domain into small cells101010We use the term “cell” to highlight the partition that yields exhaustive and mutually exclusive regions of the input space, which is essentially a norm ball in . Thus, we use the terms “cell” and “norm ball” interchangeably in this article when the emphasis is clear from the context., subject to the -separation property. Then, for each cell (and its ground truth label ), we estimate:


which are the unastuteness and pooled OP of the cell respectively—we introduce estimators for both later. Eqn. (1) can then be written as the weighted sum of the cell-wise unastuteness (i.e., the conditional pmi of each cell111111We use “cell unastuteness” and “cell pmi” interchangeably later.), where the weights are the pooled OP of the cells:


Eqn. (4) captures the essence of our RAM—it shows clearly how we incorporate the OP information and the robustness evidence to claim reliability. This reduces the problem is reduced to: (i) how to obtain the estimates on those s and s and (ii) how to measure and propagate the trust in the estimates. These two questions are challenging. To name a few of the challenges for the first question: estimating requires to determine the ground truth label of cell ; and estimating

s may require a large amount of operational data. For the second question, the fact that all estimators are imperfect entails that they need a measure of trust (e.g., the variance of a point estimate), which may not be easy to derive.

In what follows, by referring to the running example, we proceed in four main steps: (i) partition the input space into cells; (ii) approximate the OP of cells (the s); (iii) evaluate the unastuteness of these cells (the s); and (iv) “assemble” all cell-wise estimates for in a way that is informed by the uncertainty.

Step 1: Partition of the Input Domain

As per Remark 2, the astuteness evaluation of a cell requires its ground truth label. To leverage the -separation property and Assumption 3, we partition the input space by choosing a cell radius so that . Although we concur with Remark 3 (first observed by yang_closer_2020) and believe that there should exist an -stable ground truth (which means that the ground truth is stable in such a cell) for any real-world ML classification applications, it is hard to estimate such an (denoted by ) and the best we can do is to assume:

Assumption 1.

There is a -stable ground truth (as a corollary of Remark 3) for any real-world classification problems, and the parameter can be sufficiently estimated from the existing dataset.

That said, in the running example, we get by iteratively calculating the minimum distance of different labels. Then we choose a cell radius121212We use the term “radius” for cell size defined in , which happens to be the side length of the square cell of the 2D running example. , which is smaller than —we choose . With this value, we partition the unit square into cells.

Step 2: Cell OP Approximation

Given a dataset , we estimate the pooled OP of cell to get and . We use the well-established KDE to fit a to approximate the OP.

Assumption 2.

The existing dataset is randomly sampled from the OP, thus statistically represents the OP.

This assumption may not hold in practice: training data is normally collected in a balanced way, since the ML model is expected to perform well in all categories of inputs, especially when the OP is unknown at the time of training and/or expected to change in future. Although our model can easily relax this assumption (cf. Section 8), we adopt it for brevity in demonstrating the running example.

Given a set of (unlabelled) data-points from the existing dataset , KDE then yields


where is the kernel function (e.g. Gaussian or exponential kernels), and is a smoothing parameter, called the bandwidth, cf. silverman1986density for guidelines on tuning . The approximated OP131313In this case, the KDE uses a Gaussian kernel and that optimised by cross-validated grid-search bergstra_random_2012. is shown in Figure 5-rhs.

Since our cells are small and all equal size, instead of calculating , we may approximate as


where is the probability density at the cell’s central point , and is the constant cell volume ( in the running example).

Now if we introduce new variables , the KDE evaluated at is actually the sample mean of . Then by invoking the CLT, we have , where the mean is exactly the value from Eqn. (5), while the variance of is a known result of:


where the last step of Eqn. (7) says that can be approximated using a bootstrap variance chen2017tutorial (cf. Appendix A for details).

Upon establishing Eqn.s (5) and (7), together with Eqn. (6), we know for a given cell (and its central point ):


which are the OP estimates of this cell.

Step 3: Cell Astuteness Evaluation

As a corollary of Remark 3 and Assumption 1, we may confidently assume:

Assumption 3.

If the radius of is smaller than , all data-points in the cell share a single ground truth label.

Now, to determine such ground truth label of a cell , we can classify our cells into three types:

  • Normal cells: a normal cell contains data-points from the existing dataset. These data-points from a single cell are sharing a same ground truth label, which is then determined as the ground truth label of the cell.

  • Empty cells: a cell is “empty” in the sense that it contains no data-points from the existing dataset of observed points. Some of the empty cells will eventually become non-empty as more future operational data being collected, while most of them will remain empty forever: once cells are sufficiently small, only a small share of cells will refer to physically plausible images, and even fewer are possible in a given application. For simplicity, we do not further distinguish these two types of empty cells in this paper.

    Due to the lack of data, it is hard to determine an empty cell’s ground truth. For now, we do voting based on labels predicted (by the ML model) for random samples from the cell, making the following assumption.

    Assumption 4.

    The accuracy of the ML model is better than a classifier doing random classifications in any given cell.

    This assumption essentially relates to the oracle problem of ML testing, for which we believe that recent efforts (e.g. guerriero_reliability_2020) and future research may relax it.

  • Cross-boundary cells: our estimate of based on the existing dataset is normally imperfect, e.g. due to noise in the dataset, and the size of the dataset is not large enough. Thus, we may still observe data-points with different labels in a single cell (especially when new operational data with labels is collected). Such cells are crossing the classification boundary. If our estimate on is sufficiently accurate, they should be very rare. Without the need to determine the ground truth label of a cross boundary cell, we simply and conservatively set the cell unastuteness to 1.

So far, the problem is reduced to: given a normal or empty cell with the known ground truth label , evaluate the misclassification probability upon a random input , , and its variance . This is essentially a statistical problem that has been studied in webb_statistical_2019 using Multilevel Splitting Sampling, while we use the SMC method for brevity in the running example:

The CLT tells us when is large, where and are the population mean and variance of . They can be approximated with sample mean and sample variance , respectively. Finally, we get


Notably, to solve the above statistical problem with sampling methods, we need to assume how the inputs in the cell are distributed, i.e., a distribution for the conditional OP . Without loss of generality, we assume:

Assumption 5.

The inputs in a small region like a cell are uniformly distributed.

This assumption is not uncommon (e.g., it is made in webb_statistical_2019; weng_proven_2019) and can be easily replaced by other distributions, provided there is supporting evidence for such a change.

Step 4: Assembling of the Cell-Wise Estimates

Eqn. (4) represents an ideal case in which we know those s and s with certainty. In practice, we can only estimate them with imperfect estimators yielding, e.g., a point estimate with variance capturing the measure of trust141414This aligns with the traditional idea of using FTA (and hence the assurance arguments around it) for future reliability assessment.. To assemble the estimates of s and s to get the estimates on , and also to propagate the confidence in those estimates, we assume:

Assumption 6.

All s and s are independent unknown variables under estimations.

Then, the estimate of and its variance are:


Note that, for the variance, the covariance terms are dropped due to the independence assumption.

Depending on the specific estimators adopted, certain parametric families of the distribution of

can be assumed, from which any quantile of interest (e.g., 95%) can be derived as our confidence bound in reliability. For the running example, we might assume

as an approximation by invoking the (generalised) CLT151515Assuming s and s are all normally and independently but not identically distributed, the product of two normal variables is approximately normal while the sum of normal variables is exactly normal, thus the variable

is also approximated as being normally distributed (especially when the number of sum terms is large).

. Then, an upper bound with confidence is


where , and is a standard normal distribution.

4.3 Extension to High-Dimensional Dataset

In order to better convey the principles and main steps of our proposed RAM, we have demonstrated a “low-dimensional” version of our RAM, which is tailored for the running example (a synthetic 2D-dataset). However, real-world applications normally involve high-dimensional data like images, exposing the presented “low-dimensional” RAM to scalability challenges. In this section, we investigate how to extend our RAM for high-dimensional data, and take a few practical solutions to tackle the scalability issues raised by “the curse of dimensionality”.

Approximating the OP in the Latent Feature Space Instead of the Input Pixel Space

The number of cells yielded by the previously discussed way of partitioning the input domain (pixel space) is exponential in the dimensionality of data. Thus, it is hard to accurately approximate the OP due to the relatively sparse data collected: the number of cells is usually significantly larger than the number of observations made. However, for real-world data (say an image), what really determines the label is its features rather than the pixels. Thus, we envisage some latent space, e.g. compressed by VAE, that captures only the feature-wise information; this latent space can be explored for high-dimensional data. That is, instead of approximating the OP in the input pixel space, we (i) first encode/project each collected data-point into the compressed latent space, reducing its dimensionality, (ii) then fit a “latent space OP” with KDE based on the compressed dataset, and (iii) finally “map” data-points (paired with the learnt OP) in the latent space back to the input space.

Remark 4 (mapping between feature and pixel spaces).

Depending on which data compression technique we use and how the “decoder” works, the “map” action may vary case by case. For the VAE adopted in our work, we decode one point from the latent space as a “clean” image (with only feature-wise information), and then add perturbations to generate a norm ball (with a size determined by the -separation distance, cf. Remark 3) in the input pixel space.

Applying Efficient Multivariate KDE for Cell OP Approximation

We may encounter technical challenges when fitting the PDF from high-dimensional datasets. There are two known major challenges when applying multivariate KDE to high-dimensional data: i) the choice of bandwidth represents the covariance matrix that mostly impacts the estimation accuracy; and ii) scalability issues in terms of storing intermediate data structure (e.g., data-points in hash-tables) and querying times made when estimating the density at a given input. For the first challenge, the optimal calculation of the bandwidth matrix can refer to some rule of thumb silverman1986density; scott2015multivariate and the cross-validation bergstra_random_2012. There is also dedicated research on improving the efficiency of multivariate KDE, e.g., backurs2019space presents a framework for multivariate KDE in provably sub-linear query time with linear space and linear pre-processing time to the dimensions.

Applying Efficient Estimators for Cell Robustness

We have demonstrated the use of SMC to evaluate cell robustness in our running example. It is known that SMC is not computationally efficient to estimate rare-events, such as AEs in the high-dimensional space of a robust ML model. We therefore need more advanced and efficient sampling approaches that are designed for rare-events to satisfy our need. We notice that the Adaptive Multi-level Splitting method has been retrofitted in webb_statistical_2019 to statistically estimate the model’s local robustness, which can be (and indeed has been) applied in our later experiments for image datasets. In addition to statistical approaches, formal method based verification techniques might also be applied to assess a cell’s pmi, e.g., huang_safety_2017. They provide formal guarantees on whether or not the ML model will misclassify any input inside a small region. Such “robust region” proved by formal methods is normally smaller than our cells, in which case the can be conservatively set as the proportion of the robust region covered in cell (under Assumption 5).

Assembling a Limited Number of Cell-Wise Estimates with Informed Uncertainty

The number of cells yielded by current way of partitioning the input domain is exponential to the dimensionality of data, thus it is impossible to explore all cells for high-dimensional data as we did for the running example. We may have to limit the number of cells under robustness evaluation due to the limited budget in practice. Consequently, in the final “assembling” step of our RAM, we can only assemble a limited number of cells, say , instead of all cells. In this case, we refer to the estimator designed for weighted average based on samples bevington_data_1993. Specifically, we proceed as what follows:

  • Based on the collected dataset with data-points, the OP is approximated in a latent space, which is compressed by VAE. Then we may obtain a set of norm balls (paired with their OP) after mapping the compressed dataset to the input space (cf. Remark 4) as the sample frame161616While the population is the set of (non-overlapping) norm balls covering the whole input space, i.e. the cells mentioned in the “lower-dimensional” version of the RAM..

  • We define weight for each of the norm balls according to their approximated OP, .

  • Given a budget that we can only evaluate the robustness of norm balls, samples are randomly selected (with replacement) and fed into the robustness estimator to get .

  • We may invoke the unbiased estimator for weighted average

    (bevington_data_1993, Chapter 4) as


    Moreover, a confidence upper bound of interest can be derived from Eqn. (13).

Note that there is no variance terms of and in Eqn.s (14) and (15), implying the following assumption:

Assumption 7.

The uncertainty informed by Eqn. (15) is sourced from the sampling of norm balls, which is assumed to be the major source of uncertainty. This makes the uncertainties contributed by the robustness and OP estimators (i.e. the variance terms of and ) negligible.

4.4 Evaluation on the Proposed Ram

In addition to the running example, we conduct experiments on two more synthetic 2D-datasets, as shown in Figure 6

. They represent scenarios with relatively sparse and dense training data, respectively. Moreover, to gain insights on how to extend our RAM for high-dimensional datasets, we also conduct experiments on the popular MNIST and CIFAR10 datasets. Instead of implementing the steps in Section 

4.2, we take solutions to tackle the scalability issues raised by “the curse of dimensionality”, as articulated in Section 4.3. Finally, all modelling details and results after applying our RAM on those datasets are summarised in Table 1, where we compare the testing error, ACU defined by Definition 3, and our RAM results (of the mean , variance and a 97.5% confidence upper bound ).

Definition 3 (Acu).

Stemmed from the Definition 2 and Remark 2, the unastuteness of a region is consequently where is the ground truth label of (cf. Eqn. 3). Then we define the ACU of the ML model as:


where is the total number of regions.

Figure 6: Synthetic datasets DS-1 (lhs) and DS-2 (rhs) representing relatively sparse and dense training data respectively.
train/test error -separation radius # of cells ACU time
The run. exp. 0.0005/0.0180 0.004013 0.004 0.002982 0.004891 0.000004 0.004899 0.04
Synth. DS-1 0.0037/0.0800 0.004392 0.004 0.008025 0.008290 0.000014 0.008319 0.03
Synth. DS-2 0.0004/0.0079 0.002001 0.002 0.004739 0.005249 0.000002 0.005252 0.04
MNIST 0.0051/0.0235 0.369 0.300 Fig. 7(b) Fig. 7(a) Fig. 7(a) Fig. 7(a) 0.43
CIFAR10 0.0199/0.0853 0.106 0.100 Fig. 7(d) Fig. 7(c) Fig. 7(c) Fig. 7(c) 6.74
Table 1: Modelling details and results of applying the RAM on five datasets. Time is in seconds per cell.

In the running example, we first observe that the ACU is much lower than the testing error, which means that the underlying ML model is a robust one. Since our RAM is largely based on the robustness evidence, its results are close to ACU, but not exactly the same because of the nonuniform OP, cf. Figure 5-rhs.

Remark 5 (ACU is a special case of pmi).

When the OP is “flat” (uniformly distributed), ACU and our RAM result regarding pmi are equal, which can be seen from Eqn. 4 by setting all s equally to .

Moreover, from Figure 5-lhs, we know that the classification boundary is near the middle of the unit square input space where misclassifications tend to happen (say, a “buggy area”), which is also the high density area on the OP. Thus, the contribution to unreliability from the “buggy area” is weighted higher by the OP, explaining why our RAM results are worse than the ACU. In contrast, because of the relatively “flat” OP for the DS-1 (cf. Figure 6-lhs), our RAM result is very close to the ACU (cf. Remark 5). With more dense data in DS-2, the -distance is much smaller and leads to smaller cell radius and more cells. Thanks to the rich data in this case, all three results (testing error, ACU, and the RAM) are more consistent than in the other two cases. We note that, given the nature of the three 2D-point datasets, ML models trained on them are much more robust than image datasets. This is why all ACUs are better than test errors, and our RAM finds a middle point representing reliability according to the OP. Later we apply the RAM on two unrobust ML models trained on image datasets, where the ACUs are worse than the test error; it confirms our aforementioned observations.

Figure 7: The mean, variance and 97.5% confidence upper bound of pmi and ACU as functions of sampled norm balls.

Regarding the MNIST and CIFAR10 datasets, we first train VAE on them and compress the datasets into the low dimensional latent spaces of VAE with 8 and 16 dimensions, respectively. We then fit the compressed dataset with KDE to approximate the OP. Each compressed data-point is now associated with a weight representing its OP. Consequently, each norm ball in the pixel space that corresponds to the compressed data-point in the latent space (after the mapping, cf. Remark 4) is also weighted by the OP. Taking the computational cost into account—say only the astuteness evaluation on a limited number of norm balls is affordable—we do random sampling, invoke the estimator for weighted average Eqn.s (14) and (15), and plot our RAM results as functions of in Figure 7(a) and 7(c). For comparison, we also plot the ACU results171717As per Remark 5, ACU is a special case of pmi with equal weights. Thus, ACU results in Figure 7 are also obtained by Eqn.s (14) and (15). in Figure 7(b) and 7(d).

In Figure 7, we first observe that both, the ACU results (after converging) of MNIST and CIFAR10, are worse than their test errors (in Table 1), unveiling again the robustness issues of ML models when dealing with image datasets (while the ACU of CIFAR10 is even worse, given that CIFAR10 is indeed a generally harder dataset than MNIST). For MNIST, the mean pmi estimates are much lower than ACU, implying a very “unbalanced” distribution of weights (i.e. OP). Such unevenly distributed weights are also reflected in both, the oscillation of the variance and the relatively loose 97.5% confidence upper bound. On the other hand, the OP of CIFAR10 is flatter, resulting in closer estimates of pmi and ACU (Remark 5). In summary, for real-world image datasets, our RAM may effectively assess the robustness of the ML model and its generalisability based on the shape of its approximated OP, which is much more informative than either the test error or ACU alone.

5 Probabilistic Safety Arguments for ML Components

At this lower level of ML components, cf. the SubC7 in Figure 3, we further decompose and organise our safety arguments in two levels—decomposing sub-functionalities of ML components doing object detection and claiming the reliability of the classification function. In the following sections, we discuss both of them in details, while focusing more on the latter.

5.1 Arguments for Top Claims on Object Detection at the ML Component-Level

In Figure 8, we present an argument template, again in the CAE blocks at the ML component-level. It aims at breaking down the claim “The object detection is safe enough” LLC1 to a reliability claim stated in the specified measure. The first argument is over all safety related properties, and presented by a CAE block of substitution. The list of all properties of interest for the given application can be obtained by utilising the Property Based Requirements (PBR) approach Micouin2008, forming the side-claim LLSC1, which is supported by the sub-case SubC10. The PBR analysis, recommended in alves_considerations_2018 as a method for safety arguments of autonomous systems, is a way to specify requirements as a set of properties of system objects either in a structured language or formal notations. In this work, we focus on the main quantitative property—reliability—while other properties like security and interpretability are omitted and remain an undeveloped sub-case SubC9 in the CAE template.

Figure 8: ML component-level arguments breaking down the claim “The object detection component is safe enough” LLC1 to reliability claims of the classification function stated in specific reliability measures SubC11.

Starting from LLC2, we then argue over the decomposition by four sub-functionalities of object detection. At the “birth” of an object in the system’s vision (e.g., the total number of pixels is greater than a threshold), the ML component should accurately classify it, localise it (normally measured by the Intersection over Union (IoU) of bounding boxes) and in a good timing (e.g., no later than some frames after its birth). Once initially detected at its birth time, the tracking function on that object should be reliable enough to make decision making by other control components safe. The four sub-functionalities of object detection forms the claims LLC3-LLC5.

To support the reliability of classification at the birth time of the objects LLC3, we concretise the reliability requirements in terms of specific reliability measures, in our case pmi. The “misclassification” and “per input” in pmi need to be clearly defined: (i) we only consider safety-related misclassification events; (ii) an input refers to the image frame capturing the “birth” of an object in the camera’s vision (so that images can be treated as independent conforming to the definition of pmi). We are then interested in the claim of a bound on pmi with confidence, where is a required bound derived from higher level safety analysis.

While the reliability of the other three sub-functionalities can be similarly concretised by some quantitative measures, e.g. IoU for localisation, they remain undeveloped in this article and form important future work.

5.2 Low-Level Arguments for Classification Based on the Ram

In this section, we present SubC11 and show how to support a reliability claim stated in pmi based on our RAM developed in Section 4—the “backbone” of the probabilistic arguments at this lower level. Essentially, we argue over the four main steps of our RAM as shown in Figure 9. Note that, depending on the data dimensionality of the specific application, we may either use the “low-dimensional” version of our RAM, where the whole input space is partitioned into cells, or apply the “high-dimensional” version, in which norm balls (of relatively spare data) are determined instead (cf. Remark 4) based on the collected data to form the sample frame (representing the population of all norm balls partitioning the whole input space). Indeed, the method of exhaustively partitioning cells is also applicable to high-dimensional data, but it would yield an extremely large number of cells that is not only infeasible to exhaustively examine them but also quite difficult to index for sampling. That said, for high-dimensional datasets, we determine norm balls from the data instead, forming a smaller and more practical sampling frame. However, the price paid is at introducing two more noise factors in the assurance—the bias/error from the construction of the sampling frame and the relatively low sample rate. The former can be mitigated by conventional ways of checking (and rebuilding if necessary) the sampling frame, while the latter has been captured and quantified by the variance of the point estimate (cf. Eqn. (15) and Assumption 7).

Figure 9: Arguments over the four main steps in the proposed RAM.

Figures 10 to 13 show the arguments based on steps 1 to 4 of our RAM, respectively. While the arguments presented in CAE are self-explanatory together with the technical details articulated in Section 4, we note that i) all modelling assumptions are presented as side-claims of arguments that need more application-specific development and justification; and ii) the development of some claims are omitted for brevity, because they are generic claims and thus can be referred to other works, e.g. ashmore2021assuring, for SubC11-C3.2 and SubC11-C3.3 when we treat the OP estimator as a common data-driven learning model.

Figure 10: Arguments based on the step 1 of the RAM.
Figure 11: Arguments based on the step 2 of the RAM.
Figure 12: Arguments based on the step 3 of the RAM.
Figure 13: Arguments based on the step 4 of the RAM.

6 A Case Study of AUV Missions

In this section, a case study based on a simulated AUV that performs survey and asset inspection missions is conducted. We first describe the scenario in which the mission is performed, details of the AUV under test, and how the simulator is implemented. Then, corresponding to Section 3, we exercise the proposed assurance activities for this AUV application, i.e., HAZOP, hazards scenarios modelling, FTA, and discussions on deriving the system-level quantitative safety target for this scenario. Finally, we apply our RAM on the image dataset collected from a large amount of statistical testing. All source code, simulators, ML models, datasets and experiment results are publicly available on our project website with a video demo at

6.1 Scenario Design

AUV are increasingly adopted for marine science, offshore energy, and other industrial applications in order to increase productivity and effectiveness as well as to reduce human risks and offshore operation of crewed surface support vessels lane_new_2016. However, the fact that AUVs frequently operate in close proximity to safety-critical assets (e.g., offshore oil rigs and wind turbines) for inspection, repair and maintenance tasks leads to challenges on the assurance of their reliability and safety, which motivates the choice of AUV as the object of our case study.

6.1.1 Mission Description and Identification of Mission Properties

Based on industrial use cases of autonomous underwater inspection, we define a test scenario for AUVs that need to operate autonomously and carry out a survey and asset inspection mission, in which an AUV follows several way-points and terminates with autonomous docking. During the mission, it needs to detect and recognise a set of underwater objects (such as oil pipelines and wind farm power cables) and inspect assets (i.e., objects) of interest, while avoiding obstacles and keeping the required safe distances to the assets.

Given the safety/business-critical mission, different stakeholders have their own interests on a specific set of hazards and safety elements. For instance, asset owners (e.g., wind farm operators) focus more on the safety and health of the assets that are scheduled to be inspected, whereas inspection service providers tend to have additional concerns regarding the safety and reliability of their inspection service and vehicles. In contrast, regulators and policy makers may be more interested in environmental and societal impacts that may arise when a failure unfortunately happens. By keeping these different safety concerns in mind, we identify a set of desirable mission properties, whose violation may lead to unsuccessful inspection missions, compromise the integrity of critical assets, or damage of the vehicle itself.

While numerous high-level mission properties are identified based on our engineering experience, references to publications (e.g., hereau_testing_2020) and iterations of hazard analysis, we focus on a few that are instructive for the ML classification function in this article (cf. the project website for a complete list):

  • No miss of key assets: the total number of correctly recognised assets/objects should be equal to the total number of assets that are required to be inspected during the mission.

  • No collision: during the full mission, the AUV should avoid all obstacles perceived without collision.

  • Safe distancing: once an asset is detected and recognised, the Euclidean distance between the AUV and the asset must be kept to be at least the defined minimal safe operating distance.

  • Autonomous docking: safe and reliable docking to the docking cage.

Notably, such an initial set of desirable mission properties forms the starting point of our assurance activities, cf. Figure 4 and Section 6.2.

6.1.2 The AUV Under Test


Although we are only conducting experiments in simulators at this stage, our trained ML model can be easily deployed to real robots and the experiments are expected to be reproducible in real water tanks. Thus, we simulate the AUV in our laboratory—a customised BlueROV2, which has 4 vertical and 4 horizontal thrusters for 6 degrees of freedom motion. As shown in Figure 

14-lhs, it is equipped with a custom underwater stereo camera designed for underwater inspection. A Water Linked A50 Doppler Velocity Log (DVL) is installed for velocity estimation and control. The AUV also carries an Inertial Measurement Unit (IMU), a depth sensor and a Tritech Micron sonar. The AUV is extended with an on-board Nvidia Jetson Xavier GPU computer and a Raspberry Pi 4 embedded computer. An external PC can also be used for data communication, remote control, mission monitoring, and data visualisation of the AUV via its tether.

Figure 14: Hardware and software architecture and key modules for autonomous survey and inspection missions.
Software Architecture

With the hardware platform, we develop a software stack for underwater autonomy based on the ROS. The software modules that are relevant to the aforementioned AUV missions are (cf. Figure 14):

  • Sensor drivers. All sensors are connected to on-board computers via cables, and their software drivers are deployed to capture real-time sensing data.

  • Stereo vision and depth estimation. This is to process stereo images by removing its distortion and enhancing its image quality for inspection. After rectifying stereo images, they are used for estimating depth maps that are used for 3D mapping and obstacle avoidance.

  • Localisation and mapping algorithm. In order to navigate autonomously and carry out a mission, we need to localise the vehicle and build a map for navigation. We develop a graph optimisation based underwater simultaneous localisation and mapping system by fusing stereo vision, DVL, and IMU. It also builds a dense 3D reconstruction model of structures for geometric inspection.

  • Detection and recognition model. This is one of the core modules for underwater inspection based on ML models. It is designed to detect and recognise objects of interest in real-time. Based on the properties of detected objects— in particular the underwater assets to inspect—the AUV makes decisions on visual data collection and inspection.

  • Obstacle avoidance and path planning. The built 3D map and its depth estimation are used for path planning, considering obstacles perceived by the stereo vision. Specifically, a local trajectory path and its way-points are generated in the 3D operating space based on the 3D map built from the localisation and mapping algorithm. Next the computed way-point is passed to the control driver for trajectory and way-point following.

  • Control driver. We have a back seat driver for autonomous operations, enabling the robot to operate as an AUV. Once the planned path and/or a way-point is received, a proportional–integral–derivative (PID) based controller is used to drive the thrusters following the path and approaching to the way-point. The controller can also be replaced by a learning based adaptive controller. While the robot moves in the environment, it continues perceiving the surrounding scene and processing the data using the previous software modules.

ML Model Doing Object Detection

In this work, the state-of-the-art Yolo-v3 DL architecture redmon2018yolov3

is used for object detection. Its computational efficiency and real-time performance are both critical for its application for underwater robots, as they mostly have limited on-board computing resources and power. The inference of Yolo can be up to 100 frames per second. Yolo models are also open source and built using the C language and the library is officially supported by OpenCV, which makes its integration with other AUV systems not covered in this work straightforward. Most DL-based object detection methods are extensions of a simple classification network. The object detection network usually generates a set of proposal bounding boxes; they might contain an object of interest and are then fed to a classification network. The Yolov3 network is similar in operation to, and is based on, the

darknet53 classification network.

The process of training the Yolo networks using the Darknet framework is similar to the training of most ML models, which includes data collection, model architecture implementation, and training. The framework consists of configuration files that can be set to match the number of object classes and other network parameters. Examples of training and testing data are described in Section 6.1.3 for simulated version of the model. The model training can be summarised by the following steps: i) define the number of object categories; ii) collect sufficient data samples for each category; iii) split the data into training and validation sets; and iv) use the Darknet software framework to train the model.

6.1.3 The Simulator

The simulator uses the popular Gazebo robotics simulator in combination with a simulator for underwater dynamics. The scenario models can be created/edited using Blender 3D software. We have designed the Ocean Systems Lab’s wave tank model (cf. Figure 15-lhs) for the indoor simulated demo, using BlueROV2 within the simulation to test the scenarios. The wave tank model has the same dimension as our real tank.

Figure 15: A wave-tank for simulated testing and a simulated pool for collecting the training data.

To ensure that the model does not overfit the data, we have designed another scenario with a bigger pool for collecting the training data. The larger size allows for more distance between multiple objects, allowing both to broaden the set training scenarios and to make them more realistic. The simulated training environment is presented in Figure 15-rhs.

Our simulator creates configuration files to define an automated path using Cartesian way-points for the vehicle to follow autonomously, which can be visualised using Rviz. The pink trajectory is the desirable path and the red arrows represent the vehicle poses following the path, cf. Figure 16-lhs. There are six simulated objects in the water tank. They are a pipe, a gas tank, a gas canister, an oil barrel, a floating ball, and the docking cage, as shown in Figure 16-rhs. The underwater vehicle needs to accurately and timely detect them during the mission. Notably, the mission is also subject to random noise factors, so that repeated missions will generate different data that is processed by the learning-enabled components.

Figure 16: Simulated AUV missions following way-points and the six simulated objects.

6.2 Assurance Activities for the AUV

Hazard Analysis via HAZOP

Given the AUV system architecture (cf. Figure 14) and control/data flow among the nodes, we may conduct a HAZOP analysis that yields a complete version of Table 2. For this work, we only present partial HAZOP results and highlight a few hazards that are due to misclassification.

HAZOP item:
or attribute
Guide-word Cause Consequence Mitigation
flow from object
detection to
obstacle avoidance
and path
data flow too late
data value wrong value misclassification
erratic navigation;
unsafe distance to assets;
collision to assets;
failed inspection.
acoustic guidance;
minimum DL-classifier
reliability for critical
objects; maximum safe
distance maintained
if uncertain;
no value
Table 2: Partial HAZOP results, highlighting the cause of misclassification (NB, entries of “…” are intentionally left blank).
Hazard Scenarios Modelling

Inspired by guo_extended_2015, we have develop the hazard scenarios as chains of events that link the causes to consequences identified by HAZOP. Again, for illustration, a single event-chain is shown in Figure 17, which propagates the event of misclassification on assets via the system architecture to the violation of mission property of keeping a safe distance to assets. Later, readers will see this event-chain forms one path of a fault tree in the FTA in Figure 18.

Figure 17: A single event-chain based on the hazard scenario modelling, linking causes to consequences.
Quantitative FTA

We first construct fault trees for each hazard (as TE) identified by HAZOP, by extending and combining (via logic gates) the IEs modelled by hazard scenario analysis. Each event-chain yielded by the hazard scenario analysis then forms one path in a fault tree. For instance, the event-chain of Figure 17 eventually becomes the path of BE-0-1 IE-1-1 IE-2-2 IE-3-2 TE in Figure 18. Finally, knowing the probabilities of BEs and logic gates allows for the calculation of the TE probability. As shown by the second iteration loop in Figure 4, several rounds of what-if calculations, sensitivity analysis and updates of the components are expected to yield the most practical solution of BE probabilities that associates with a given tolerable risk of the TE.

Figure 18: A partial fault tree for the TE of loss of a safe distance to assets. NB, the “cloud” notation represents omitted sub-trees.
Deriving Quantitative System Safety Target

Based on the experience of relatively more developed safety-critical domains of AI, such as self-driving cars and medical devices (cf. Section 3.2 for some examples), we believe that referring to the average performance of human divers and/or human remote control operators is a promising way of determining the high-level quantitative safety target for our case of an AUV. It is presumed that, prior to the use of an AUV for assets inspection, human divers and remotely controlled robots need to conduct the task regularly. This is also similar to how the safety targets were developed in the civil aircraft sector where they refer to acceptable historical accident rates as the benchmark. In our case, referring to the human-divers/operators’ performance as a target for an AUV’s safety risk can be potentially impeded by the lack of historical/statistical data on such performance. Given the fact that ML model for AUV is a relatively novel technique and still developing and transforming to its practical uses, an urgent lesson learnt for all AUV stakeholders (especially manufacturers, operators and end users) from this work is to collect and summarise such data.

6.3 Reliability Modelling of the AUV’s Classification Function

Details of the Yolo3 model trained in this case study is presented in Table 3, B. We adopt the practical solutions discussed in Section 4.3 to deal with the high dimensionality of the collected operational dataset (256*256*3) by first training a VAE model and compressing the dataset into a new space with a much lower dimensionality of 8. While training details of the VAE model are summarised in Table 4, four sets of examples are shown in Figure 20, from which we can see that the reconstructed images are preserving the essential features of the objects (while blurring the less important background). We then choose a norm ball radius according to the -separation distance181818Because more than one object may appear in a single image, the label of the “dominating” object (e.g., the object with the largest bounding box and/or with higher priority) can be used in the calculation of . For simplicity, we first preprocess the dataset by filtering out images with multiple labels, and then determine the based on an estimated . and invoke the KDE and robustness estimator webb_statistical_2019 for randomly selected norm balls. Individual estimates of the norm balls are then fed into the estimator for weighted average, Eqn.s (14) and (15). For comparison, we also calculate the ACU by assuming equal weights (i.e., a flat OP) in Eqn.s (14) and (15). Finally, the reliability claims on pmi and ACU are plotted as functions of in Figure 19. Interpretation of the results is similar as before for CIFAR10, where the OP is also relatively flat.

Figure 19: The mean, variance and 97.5% confidence upper bound of AUV’s pmi and ACU as functions of sampled norm balls.

7 Related Work

Assurance Cases for AI/ML-powered Autonomous Systems

Work on safety arguments and assurance cases for AI/ML models and autonomous systems has emerged in recent years. Burton et al. burton_mind_2020 draw a broad picture of assuring AI and identify/categorise the “gap” that arises across the development process. Alves et al. alves_considerations_2018 present a comprehensive discussion on the aspects that need to be considered when developing a safety case for increasingly autonomous systems that contain ML components. Similarly, in BKCF2019, an initial safety case framework is proposed with discussions on specific challenges for ML, which is later implemented with more details in bloomfield2021safety. A recent work javed_towards_2021 also explicitly suggests the combination of HAZOP and FTA in safety cases for identifying/mitigating hazards and deriving safety requirements (and safety contracts) when studying Industry 4.0 systems. In KKB2019, safety arguments that are being widely used for conventional systems—including conformance to standards, proven in use, field testing, simulation, and formal proofs—are recapped for autonomous systems with discussions on the potential pitfalls. Both, matsuno_tackling_2019 and ishikawa_continuous_2018, propose utilising continuously updated arguments to monitor the weak points and the effectiveness of their countermeasures, while a similar mechanism is also suggested in our assurance case, e.g., continuously monitor/estimate key parameters of our RAM—all essentially aligns with the idea of dynamic assurance cases calinescu_engineering_2018; asaadi_dynamic_2020.

Although the aforementioned works have inspired this article, our assurance framework is with greater emphasis on, and thus complements them from, the quantitative aspects, e.g., reasoning for reliability claims stated in bespoke measures and breaking down system-level safety targets to component-level quantitative requirements. Also exploring quantitative assurance, Asaadi et al. asaadi_quantifying_2020 identifies dedicated assurance measures that are tailored for properties of aviation systems.

OP-based Software Testing

OP-based software testing, also known as statistical/operational testing strigini_guidelines_1997, is an established practice, which is supported by industry standards for conventional systems. There is a huge body of literature in the traditional software reliability community on OP-based testing and reliability modelling techniques, e.g., bertolino_adaptive_2021; bishop_deriving_2017; pietrantuono_reliability_2020; zhao_assessing_2020. In contrast to this, OP-based software testing for ML components is still in its infancy: to the best of our knowledge, there are only two recent works that explicitly consider the OP in testing. Li et al. li_boosting_2019 propose novel stratified sampling based on ML specific information to improve the testing efficiency. Similarly, Guerriero et al. guerriero_operation_2021 develop a test case sampling method that leverages “auxiliary information for misclassification” and provides unbiased reliability estimators. However, neither of them considers robustness evidence in their assessment like our RAM does.

At the whole LES level, there are reliability studies based on operational and statistical data, e.g., kalra_driving_2016; zhao_assessing_2019 for self-driving cars, hereau_testing_2020; zhao_probabilistic_2019 for AUV, and robert_virtual_2020 for agriculture robots doing autonomous weeding. However, knowledge from low-level ML components is usually ignored. In zhao_assessing_2020, we improved kalra_driving_2016 by providing a Bayesian mechanism to combine such knowledge, but did not discuss where to obtain the knowledge. In that sense, this article also contains follow-up work of zhao_assessing_2020, providing the prior knowledge required based on the OP and robustness evidence.

Given that the OP is essentially a distribution defined over the whole input space, a related topic is the distribution-aware testing for DL developed recently. For instance, in DBLP:conf/icse/Berend21, distribution-guided coverage criteria are developed to guide the generation of new unseen test cases while identifying the validity of errors in DL system tasks. In DBLP:conf/icse/DolaDS21, a generative model is utilised to guide the generation of valid test cases. However, their notion of “distribution” normally refers to realistic perturbations on inputs such as Gaussian noise, blur, haze, contrast variation zhao2018generating, or even human imperceptible noise. Thus, it is a different notion compared to the OP that models the end-users’ behaviours.

8 Discussion

8.1 Discussions on the Proposed RAM

In this section, we summarise the model assumptions made in our RAM, and discuss if/how they can be validated and which new assumptions and compromises in the solutions are needed to cope with real-world applications with high-dimensional data. Finally, we list the inherent difficulties of assessing ML reliability uncovered by our RAM.

-Separation and its Estimation

Assumption 1 derives from Remark 3. We concur with yang_closer_2020 and believe that, for any real-world ML classification application where the inputs are data-points with “physical meanings”, there should always exist an -stable ground truth. Such -stable ground truth varies between applications, and the smaller the is, the harder the inherent difficulty of the classification problem becomes. This is therefore a difficulty indicator for the given classification problem. Indeed, it is hard to estimate the (either in the input pixel space nor the latent feature space)—the best we can do is to estimate it from the existing dataset. One way of solving the problem is to keep monitoring the estimates as more labelled data is collected, e.g. during operation, and to redo the cell partition when the estimated has changed significantly. Such a dynamic way of estimating can be supported by the concept of dynamic assurance cases asaadi_dynamic_2020.

Approximation of the OP from Data

Assumption 2 says that the collected dataset statistically represents the OP, which may not hold for many practical reasons—e.g., when the future OP is uncertain at the training stage and data is therefore collected in a balanced way to perform well in all categories of inputs. Although we demonstrate our RAM under this assumption for simplicity, it can be easily relaxed. Essentially, we try to fit a PDF over the input space from an “operational dataset” (representing the OP). Data-points in this set can be unlabelled raw data generated from historical data of previous applications and simulations, which can then be scaled based on domain expert knowledge (e.g., by DL generative models that we are currently investigating). Obtaining such an operational dataset is an application-specific engineering problem, and manageable thanks to the fact that it does not require labelled data. Notably, the OP may also be approximated at runtime based on the data stream of operational data. Efficient KDE for data streams qahtan_kde_track_2017 can be used. If the OP was subject to sudden changes, change-point detectors like zhao_interval_2020 should also be paired with the runtime estimator to robustly approximate the OP. Again, such dynamic way of estimating OP can also be supported by dynamic assurance cases asaadi_dynamic_2020.

Determination of the Ground Truth of a Cell

Assumptions 3 and 4 are essentially on how to determine the ground truth label for a given cell, which relates to the oracle problem of testing ML software. While this still remains challenging, we partially solve it by leveraging the -separation property. Thanks to , it is easy to determine a cell’s ground truth when we see that it contains labelled data-points. However, for an empty cell, it is non-trivial. We assume the overall performance of the ML model is fairly good (e.g., better than a classifier doing random classifications), thus misclassifications within an empty cell are relatively rare events. We can determine the ground truth label of the cell by majority voting of predictions. Indeed, it is a strong assumption when there are some “failure regions” in the input space, within which the ML model performs really badly (even worse than random labelling). In this case, we need new mechanism to detect such “really bad failure regions” or spend more budget on, for example, asking humans to do the labelling.

Efficiency of Cell Robustness Evaluation

Although we only applied the two methods of SMC and webb_statistical_2019 in our experiments to evaluate the local robustness, we believe that other statistical sampling methods designed for estimating the probability of rare-events could be used as well. Moreover, the cell robustness estimator in our RAM works in a “hot-swappable” manner: any new and more efficient estimator can easily be incorporated. Thus, despite being an important question, how to improve the efficiency of the robustness estimation for cells is beyond the scope of our RAM.

Conditional OP of a Cell

We assume that the distribution of inputs (the conditional OP) within each cell is uniform by Assumption 5. Although we conjecture that this is the common case due to the small size of cells (i.e., those very close/similar inputs within a small region are only subject to noise factors that can be modelled uniformly), the specific situation may vary; this requires justification in safety cases. For a real-world dataset, the conditional OP might represent certain distributions of “natural variations” zhong_understanding_2021, e.g. lighting conditions, that obey certain distributions. Ideally, the conditional OP of cells should capture the distribution of such natural variations. Recent advance on measuring the naturalness/realisticness of AEs harel_canada_is_2020 highly relates to this assumption and may relax it.

Independent s and s

As per Assumption 6, we assume all s and s are independent when “assembling” their estimates via Eqn. (11) and deriving the variance via Eqn. (12). This assumption is largely for the mathematical tractability when propagating the confidence in individual estimates at the cell-level to the pmi. Although this independence assumption is hard to justify in practice, it is not unusual in reliability models that do partition, e.g., in pietrantuono_reliability_2020; miller_estimating_1992. We believe that RAMs are still useful under this assumption, while we envisage that Bayesian estimators leveraging joint priors and conjugacy may relax it.

Uncertainties Raised by Individual OP and Robustness Estimates

This relates to how reliable the chosen OP and robustness estimators themselves are. Our RAM is flexible and evolvable in the sense that it does not depend on any specific estimators. New and more reliable estimators can therefore easily be integrated to reduce the estimation uncertainties. Moreover, such uncertainties raised by estimators are propagated and compounded in our overall RAM results, cf. Eqn.s (12) and (15). Although we ignore them as per Assumption 7, this is arguably the case when the two estimators are fairly reliable and the number of samples is much smaller than the sample frame size .

Inherent Difficulties of Reliability Assessment on ML Software

Finally, based on our RAM and the discussions above, we summarise the inherent difficulties of assessing ML reliability as the following questions:

  • How to accurately learn the OP in a potentially high-dimensional input space with relatively sparse data?

  • How to build an accurate test oracle (to determine the ground truth label) by, e.g., leveraging the existing labels (done by humans) in the training dataset?

  • What is the local distribution (i.e. the conditional OP) over a small input region (which is potentially only subject to subtle natural variations of physical conditions in the environment)?

  • How to efficiently evaluate the robustness of a small region, given that AEs are normally rare events? And how to reduce the risk associated with an AE (e.g., referring to ALARP)?

  • How to efficiently sample small regions from a large population (due to the high-dimensionality) of regions to test the local robustness in an unbiased and uncertainty informed way, given a limited budget?

We provide solutions in our RAM that are practical compromises (cf. Section 4.3), while the questions above are still challenging. At this stage, we doubt the existence of other RAMs for ML software with weaker assumptions that achieve the same level of rigorousness as ours, in which sense our RAM advances in this research direction.

8.2 Discussions on the Overall Assurance Case Framework and Low-Level Probabilistic Safety Arguments

With the emphasis on quantitative aspects of assuring LES (and thus complementing existing assurance frameworks, e.g., bloomfield2021safety), our overall assurance framework and the low-level probabilistic safety arguments together form an “vertically” end-to-end assurance case, in which a chain of safety/reliability techniques are integrated. However, the assurance case presented is still incomplete “horizontally”—some sub-cases and (side-)claims are undeveloped. Because, they are either generic claims that have been studied elsewhere (and omitted for simplicity), e.g. in bloomfield2021safety; ashmore2021assuring, or are still quite hard to argue in general and thus require specific expert judgement in a case-by-case manner.

The proposed safety analysis activities—HAZOP, hazard scenarios modelling, FTA, our RAM, and the determination of the system-level safety targets based on the performance of human/similar-products—are not exclusive in our assurance framework; rather we concur with KKB2019 that credible safety cases require a heterogeneous approach. A dangerous pitfall is that those activities are not performed sufficiently because of, say, the analyser’s limited engineering knowledge/experience and the lack of empirical data. This is, however, not unique to our assurance framework, but rather generic to any assurance studies.

We only present safety arguments for the classification function of the ML component, based on our new RAM for ML classifiers, leaving claims for the other three functions—localisation, detection timing, and object tracking—undeveloped191919Certainly for real safety cases, we also need to develop claims on “non-ML” parts (e.g., capability of the development team and quality of the code) which can be addressed by conventional approaches that we omit in this work.. The general idea and principles, however, are applicable to the other three functions, too: we may first define bespoke reliability measures for each (like pmi for classification), and then do probabilistic reliability modelling based on statistical testing evidence. This forms important future work.

8.3 Discussions on the Simulated AUV Case Study

So far, we have conducted a case study in simulators to validate and demonstrate our proposed methods. While defending the role of simulation in certification and regulation is beyond the scope of this work, simulation is arguably necessary for many reasons as long as the simulation satisfies some prerequisites—for example, the fidelity is justifiable, scenario-coverage is sufficiently high, and non-zero real-world testing is conducted to validate the simulation. That said, we plan to conduct a real-world case study in a physical wave tank, in which the conditions may be adjusted to have real-world disturbances, e.g., generating various types of waves in offshore scenarios and changing the lighting conditions.

9 Conclusion and Future Work

This article introduces a RAM designed for ML classifiers, extending its initial version of zhao_assessing_2021 with more practical considerations for real-world applications of high-dimensional data and autonomous systems, e.g., the new estimator Eqn.s (14) and (15), alternative solutions discussed in Section 4.3, and new experiments on image datasets and an AUV mission. To the best of our knowledge, it is the first ML RAM that explicitly considers both the OP information and robustness evidence. It has also allowed us to uncover some inherent challenges when assessing ML reliability. Based on the RAM, we present probabilistic safety arguments for ML components incorporating low-level VnV evidence. To complete the “big picture”, we also propose an overall assurance framework, in which a set of safety analysis activities are integrated to identify the whole LES level safety targets and break down them to component-level reliability requirements of ML functions. Finally, a case study based on simulated AUV is conducted. The case study is comprehensive in terms of exercising and demonstrating all proposed methods in our assurance framework and also identifying key challenges with recommendations for ML models of autonomous systems.

An intuitive way of perceiving our RAM, comparing with the usual accuracy testing, is that we enlarge the test set with more test cases around the “seeds” (original data-points in the test set). We determine the oracle of a new test case according to its seed’s label and the -distance. Those enlarged test results form the robustness evidence, and how much they contribute to the overall reliability is proportional to its OP. Consequently, exposing to more tests (robustness evaluation) and being more representative of how it will be used (the OP), our RAM is more informative—and therefore more trustworthy. In line with the gist of our RAM, we believe that the DL reliability should follow the conceptualised equation of:

In a nutshell, this equation says that, when assessing the reliability of ML software, we should not only consider how the DL model generalises to a new data-point (according to the future OP), but also take the local robustness around that new data-point into account.

Apart from the future work mentioned in the discussion section, we also plan to conduct more real-world case studies to examine the scalability of our methods. We presume a trained ML model for our assessment purpose. A natural follow-up question is how to actually improve the reliability when our RAM results indicate that a system is not good enough. As described in zhao_detecting_2021, we plan to investigate integrating ML debug testing (e.g. huang2021coverage) and retraining methods bai_recent_2021 with the RAM, to form a closed loop of debugging-improving-assessing. Last but not least, we find the idea of dynamic assurance cases asaadi_dynamic_2020 may have a great potential for addressing some challenges we currently face in our framework.

Appendix A KDE Bootstrapping

Bootstrapping is a statistical approach to estimate any sampling distribution by random a sampling method. We sample with replacement from the original data points to obtain a new bootstrap dataset and train the KDE on the bootstrap dataset. Assume the bootstrapping process is repeated times, leading to bootstrap KDEs, denoted as . Then we can estimate the variance of by the sample variance of the bootstrap KDE chen2017tutorial:

where the can be approximated by

Appendix B Details of the Yolo and VAE Models Trained in the AUV Case Study

We present more details of the Yolo and VAE models trained in the AUV case study respectively in Table 3 and 4, while in Figure 20 four images reconstructed from the VAE model are shown as examples.

Class Train Test
Pipe 0.98343 0.73503 0.97131 0.72532
Floating Ball 0.85765 0.40094 0.90912 0.42536
Gas Canister 0.87230 0.62546 0.87406 0.60331
Gas Tank 0.98930 0.76552 0.99346 0.76824
Oil Barrel 0.84578 0.61437 0.84258 0.57856
Docking Cage 0.88771 0.32021 0.91076 0.33656
mAP 0.90603 0.57692 0.91688 0.57289
Table 3: Average Precision (AP) of YOLOv3 model for object detection.
VAE model Train Test
Recon. Loss 0.002601 0.003048
KL Div. Loss 1.732866 1.729756
Table 4: Reconstruction Loss and KL Divergence Loss of VAE model
Figure 20: Four original images (top row) and the corresponding reconstructed images (bottom row) by the VAE model.

Acknowledgments & Disclaimer

This work is supported by the UK Dstl (through the project of Safety Argument for Learning-enabled Autonomous Underwater Vehicles) and the UK EPSRC (through the Offshore Robotics for Certification of Assets [EP/R026173/1, EP/W001136/1] and End-to-End Conceptual Guarding of Neural Architectures [EP/T026995/1]). Xingyu Zhao and Alec Banks’ contribution to the work is partially supported through Fellowships at the Assuring Autonomy International Programme. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 956123. We thank Philippa Ryan for insightful comments on earlier versions of this work.

This document is an overview of UK MOD (part) sponsored research and is released for informational purposes only. The contents of this document should not be interpreted as representing the views of the UK MOD, nor should it be assumed that they reflect any current or future UK MOD policy. The information contained in this document cannot supersede any statutory or contractual requirements or liabilities and is offered without prejudice or commitment. Content includes material subject to © Crown copyright (2018), Dstl. This material is licensed under the terms of the Open Government Licence except where otherwise stated. To view this licence, visit or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: