Deep learning (DL) has been applied broadly in industrial sectors including automotive, healthcare, aviation and finance. To fully exploit the potential offered by DL, there is an urgent need to develop approaches to their certification in safety critical applications. For traditional systems, safety analysis has aided engineers in arguing that the system is sufficiently safe. However, the deployment of DL in critical systems requires a thorough revisit of that analysis to reflect the novel characteristics of Machine Learning (ML) in general [BKCF2019, alves_considerations_2018, KKB2019].
Compared with traditional systems, the behaviour of learning-enabled systems is much harder to predict, due to, inter alia, their “black-box” nature and the lack of traceable functional requirements of their DL components. The “black-box” nature hinders the human operators in understanding the DL and makes it hard to predict the system behaviour when faced with new data. The lack of explicit requirement traceability through to code implementation is only partially offset by learning from a dataset, which at best provides an incomplete description of the problem. These characteristics of DL increase apparent non-determinism [johnson_increasing_2018], which on the one hand emphasises the role of probabilistic measures
Recently, progress has been made on formal verification [HKWW2017] and coverage-guided testing [sun2018concolic] to support the Verification and Validation (V&V) of DL. Whilst these methods are insufficient by themselves to justify overall system safety claims, they may provide evidence to support low-level claims, e.g. the local robustness of a neural network on a given input. In this paper, we present a novel safety argument framework for DL models (which may in turn support higher-level system safety arguments). We focus on deep neural networks (DNNs) that have been widely deployed as, e.g., perception and control units of autonomous systems. Due to the page limit, we also confine the framework to DNNs that are fixed in the operation; this can be extended for online learning DNNs in future.
We consider safety-related properties including reliability, robustness, interpretability, fairness [barocas-hardt-narayanan], and privacy [Abadi_2016]. In particular, we emphasise the assessment of DNN generalisation error (in terms of inaccuracy), as a major reliability measure, throughout our safety case. We build arguments in two steps. The first is to provide initial confidence that the DNN’s generalisation error is bounded, through the assurance activities conducted at each stage of its lifecycle, e.g., formal verification on the DNN robustness. The second step is to adopt proven-in-use/field-testing arguments to boost the confidence and check whether the DNN is indeed sufficiently safe for the risk associated with its use in the system.
The second step above is done in a statistically principled way via Conservative Bayesian Inference (CBI) [bishop_toward_2011, strigini_software_2013, zhao_assessing_2019]. CBI requires only limited and partial prior knowledge of reliability, which differs from normal Bayesian analysis that usually assumes a complete prior distribution on the failure rate. This has a unique advantage: partial prior knowledge is more convincing (i.e. constitutes a more realistic claim) and easier to obtain, while complete prior distributions usually require extra assumptions and introduces optimistic bias. CBI allows many forms of prediction, e.g., posterior expected failure rate [bishop_toward_2011], future reliability of passing some demands [strigini_software_2013] or a posterior confidence on a required reliability bound [zhao_assessing_2019]. Importantly, CBI guarantees conservative outcomes: it finds the worst-case prior distribution yielding, say, a maximised posterior expected failure rate, and satisfying the partial knowledge. That said, we are aware that there are other extant dangerous pitfalls in safety arguments [KKB2019, johnson_increasing_2018], thus we also identify open challenges in our proposed framework and map them onto on-going research in the ML and software engineering communities.
The key contributions of this work are:
a) A very first safety case framework for DNNs that mainly concerns quantitative claims based on structured heterogeneous safety arguments.
b) Identification of open challenges in building safety arguments for quantitative claims, and mapping them onto on-going research of potential solutions.
2.1 Safety cases
A safety case is a comprehensive, defensible, and valid justification of the safety of a system for a given application in a defined operating environment, thus it is a means to provide the grounds for confidence and to assist decision making in certification [bloomfield_safety_2010]. Early research in safety cases mainly focus on their formulation in terms of claims, arguments and evidence elements based on fundamental argumentation theories like the Toulmin model [s_toulmin_uses_1958]. The two most popular notations are CAE [bloomfield_safety_2010] and GSN [kelly_arguing_1999]. In this paper, we choose the latter to present our safety case framework.
Fig. 1 shows the core GSN elements and a quick GSN example. Essentially, the GSN safety case starts with a top goal (claim) which then is decomposed through an argument strategy into sub-goals (sub-claims), and sub-goals can be further decomposed until being supported by solutions (evidence). A claim may be subject to some context or assumption. An away goal repeats a claim presented in another argument module. A description on all GSN elements used here can be found in [kelly_arguing_1999].
2.2 Deep neural networks and lifecycle models
Let be the training data, where
is a vector of inputs andis a vector of outputs such that . Let be the input domain and be the set of labels. Hence, . We may use and to range over and , respectively. Let be a DNN of a given architecture. A network can be seen as a function mapping from to probabilistic distributions over . That is, is a probabilistic distribution, which assigns for each possible label a probability value . We let be a function such that for any , , i.e.
returns the classification label. The network is trained with a parameterised learning algorithm, in which there are (implicit) parameters representing e.g., the number of epochs, the loss function, the learning rate, the optimisation algorithm, etc.
A comprehensive ML Lifecycle Model can be found in [ashmore_assuring_2019], which identifies assurance desiderata for each stage, and reviews existing methods that contribute to achieving these desiderata. In this paper, we refer to a simpler lifecycle model that includes several phases: initiation, data collection, model construction, model training, analysis of the trained model, and run-time enforcement.
2.3 Generalisation error
Generalisability requires that a neural network works well on all possible inputs in , although it is only trained on the training dataset .
Assume that there is a ground truth function and a probability function representing the operational profile. A network trained on has a generalisation error:
where is an indicator function – it is equal to 1 when S is true and 0 otherwise.
We use the notation to represent the probability of an input being selected, which aligns with the operational profile notion [musa_operational_1993] in software engineering. Moreover, we use 0-1 loss function (i.e., assigns value 0 to loss for a correct classification and 1 for an incorrect classification) so that, for a given , is equivalent to the reliability measure pfd (the expected probability of the system failing on a random demand) defined in the safety standard IEC-61508. A “frequentist” interpretation of pfd is that it is the limiting relative frequency of demands for which the DNN fails in an infinite sequence of independently selected demands [zhao_modeling_2017]. The primary safety measure we study here is pfd, which is equivalent to the generalisation error in (1). Thus, we may use the two terms interchangeably in our safety case, depending on the context.
3 The Top-level Argument
Fig. 2 gives a top-level safety argument for the top claim G1 – the DNN is sufficiently safe. We first argue S1: that all safety related properties are satisfied. The list of all properties of interest for the given application can be obtained by utilising the Property Based Requirements (PBR) [Micouin2008] approach. The PBR method is a way to specify requirements as a set of properties of system objects in either structured language or formal notations. PBR is recommended in [alves_considerations_2018] as a method for the safety argument of autonomous systems. Without the loss of generality, in this paper, we focus on the major quantitative property: reliability (G2). Due to space constraints, other properties: interpretability, robustness, etc. are discussed in Sec. 5 but remain an undeveloped goal (G3) here.
More properties that have a safety impact can be incorporated in the framework as new requirements emerge from, e.g., ethical aspects of the DNN.
Despite the controversy over the use of probabilistic measures (e.g., pfd) for the safety of conventional software systems [littlewood_validation_2011], we believe probabilistic measures are useful when dealing with ML systems since arguments involving their inherent uncertainty are naturally stated in probabilistic terms.
Setting a reliability goal (G2) for a DNN varies from one application to another. Questions we need to ask include: (i) What is the appropriate reliability measure? (ii) What is the quantitative requirement stated in that reliability measure? (iii) How can confidence be gained in that reliability claim?
Reliability of safety critical systems, as a probabilistic claim, will be about the probabilities/rates of occurrence of failures that have safety impacts, e.g., a dangerous misclassification in a DNN. Generally, systems can be classified as either: continuous-time systems that are being continuously operated in the active control of some process; or on-demand systems, which are only called upon to act on receipt of discrete demands. Normally we study the failure rate (number of failures in one time unit) of the former (e.g., flight control software) and the probability of failure per demand (pfd) of the latter (e.g., the emergency shutdown system of a nuclear plant). In this paper, we focus on pfd which aligns with DNN classifiers for perception, where demands are e.g., images from cameras.
Given the fact that most safety critical systems adopt a defence in depth design with safety backup channels [littlewood_reasoning_2012], the required reliability (e.g., in G2) should be derived from the higher level system, e.g., a 1-out-of-2 (1oo2) system in which the other channel could be either hardware-only, conventional software-based, or another ML software. The required reliability of the whole 1oo2 system may be obtained from regulators or compared to human level performance (e.g., a target of 100 times safer than average human drivers, as studied in [zhao_assessing_2019]). We remark that deriving a required reliability for individual channels to meet the whole 1oo2 reliability requirement is still an open challenge due to the dependencies among channels [littlewood_conceptual_1989, littlewood_conservative_2013] (e.g., a “hard” demand is likely to cause both channels to fail). That said, there is ongoing research towards rigorous methods to decompose the reliability of 1oo2 systems into those of individual channels which may apply and provide insights for future work, e.g., [bishop_conservative_2014] for 1oo2 systems with one hardware-only and one software-based channels, [littlewood_reasoning_2012, zhao_modeling_2017] for a 1oo2 system with one possibly-perfect channel, and [chen_diversity_2016] utilising fault-injection technique. In particular, for systems with duplicated DL channels, we note that there are similar techniques, e.g., (i) ensemble method [Ponti2011], where a set of DL models run in parallel and the result is obtained by applying a voting protocol; (ii) simplex architecture [Sha2001], where there is a main classifier and a safer classifier, with the latter being simple enough so that its safety can be formally verified. Whenever confidence of the main classifier is low, the decision making is taken over by the safer classifier; the safer classifier can be implemented with e.g., a smaller DNN.
As discussed in [bishop_toward_2011], the reliability measure, pfd, concerns system behaviour subject to aleatory uncertainty (“uncertainty in the world”). On the other hand, epistemic uncertainty concerns the uncertainty in the “beliefs about the world”. In our context, it is about the human assessor’s epistemic uncertainty of the reliability claim obtained through assurance activities. For example, we may not be certain whether a claim – the pfd is smaller than – is true due to our imperfect understanding about the assurance activities. All assurance activities in the lifecycle with supportive evidence would increase our confidence in the reliability claim, whose formal quantitative treatment has been proposed in [bloomfield_confidence:_2007, littlewood_use_2007]. Similarly to the idea proposed in [strigini_software_2013], we argue that all “process” evidence generated from the DNN lifecycle activities provides initial confidence of a desired pfd bound. Then the confidence in a pfd claim is acquired incrementally through operational data of the trained DNN via CBI – which we describe next.
4 Reliability with Lifecycle Assurance
4.1 CBI utilising operational data
In Bayesian reliability analysis, assessors normally have a prior distribution of pfd (capturing the epistemic uncertainties), and update their beliefs – the prior distribution – by using evidence of the observed operational data. Given the safety-critical nature, the systems under study will typically see failure-free operation or very rare failures. Bayesian inference based on such non or rare failures may introduce dangerously optimistic bias if using a Uniform or Jeffreys prior which describes not only one’s prior knowledge, but adds extra, unjustified assumptions [zhao_assessing_2019]. Alternatively, CBI is a technique, first described in [bishop_toward_2011], which applied Bayesian analysis with only partial prior knowledge; by partial prior knowledge, we mean the following typical forms:
: the prior mean pfd cannot be worse than a stated value;
: a prior confidence bound on pfd;
: a prior confidence in the perfection of the system;
: prior confidence in the reliability of passing tests.
These can be used by CBI either solely or in combination (e.g., several confidence bounds). The partial prior knowledge is far from a complete prior distribution, thus it is easier to obtain from DNN lifecycle activities (C4). For instance, there are studies on the generalisation error bounds, based on how the DNN was constructed, trained and verified [he_control_2019, bagnall_certifying_2019]. We present examples on how to obtain such partial prior knowledge (G6) using evidence, e.g. from formal verification on DNN robustness, in the next section. CBI has also been investigated for various objective functions with a “posterior” flavour:
: the posterior expected pfd [bishop_toward_2011];
: the posterior confidence bound on pfd [zhao_modeling_2017, zhao_assessing_2019]; the is normally a small pfd, stipulated at higher level;
: the future reliability of passing demands in [strigini_software_2013].
Depending on the objective function of interest (G2 is an example of a posterior confidence bound) and the set of partial prior knowledge obtained (G6), we choose a corresponding CBI model111CBI is an ongoing research proving theorems for some combinations of objective functions and partial prior knowledge. There are combinations haven’t been investigated, which remains as open challenges. for S2. Note, we also need to explicitly assess the impact of CBI model assumptions (G5). Published CBI theorems abstract the stochastic failure process as a sequence of independent and identically distributed (i.i.d.) Bernoulli trials given the unknown pfd, and assume the operational profile is constant [bishop_toward_2011, strigini_software_2013, zhao_assessing_2019]. Although we identify how to justify/relax those assumptions as open challenges, we note some promising ongoing research:
a) The i.i.d. assumption means a constant pfd (a frozen system in an unchanging environment), which may not hold for a system update or deployment in a new environment. In [littlewood_reliability_2020], the CBI is extended to a multivariate prior distribution case, which deals with scenarios of a changing pfd. The multivariate CBI may provide the basis of arguments for online learning DNNs.
b) The effect of assuming independence between successive demands has been studied, e.g., [strigini_testing_1996, galves_rare_1998]. It is believed that the effect is negligible given non or rare failures; note this requires further (preferably conservative) studies.
c) The changes to the operational profile is a major challenge for all proven-in-use/field-testing safety arguments [KKB2019]. Recent research [bishop_deriving_2017] provides a novel conservative treatment for the problem, which can be retrofitted for CBI.
The safety argument via CBI is presented in Fig. 3. In summary, we collect a set of partial prior knowledge from various lifecycle activities, then boost our posterior confidence in a reliability claim of interest through operational data, in a conservative Bayesian manner. We believe this aligns with the practice of applying management systems in reality – a system is built with claims of sufficient confidence that it may be deployed; these claims are then independently assessed to confirm said confidence is justified. Once deployed, the system safety performance is then monitored for continuing validation of the claims. Where there is insufficient evidence systems can be fielded with the risk held by the operator, but that risk must be minimised through operational restrictions. As confidence then grows these restrictions may be relaxed.
4.2 Partial prior knowledge on the generalisation error
Our novel CBI safety argument for the reliability of DNNs is essentially inspired by the idea proposed in [strigini_software_2013] for conventional software, in which the authors seek prior confidence in the (quasi-)perfection of the software from “process” evidence like formal proofs, and effective development activities. In our case, to make clear the connection between lifecycle activities and their contributions to the generalisation error, we decompose the generalisation error into three:
a) The Bayes error is the lowest and irreducible error rate over all possible classifiers for the given classification problem [fukunaga_introduction_2013]. It is non-zero if the true labels are not deterministic (e.g., an image being labelled as by one person but as by others), thus intuitively it captures the uncertainties in the dataset and true distribution when aiming to solve a real-world problem with DL. We estimate this error (implicitly) at the initiation and data collection stages in activities like: necessity consideration and dataset preparation etc.
b) The Approximation error of measures how far the best classifier in is from the overall optimal classifier, after isolating the Bayes error. The set is determined by the architecture of DNNs (e.g., numbers of layers ), thus lifecycle activities at the model construction stage are used to minimise this error.
c) The Estimation error of measures how far the learned classifier is from the best classifier in . Lifecycle activities at the model training stage essentially aim to reduce this error, i.e., performing optimisations of the set .
Both the Approximation and Estimation errors are reducible. We believe, the ultimate goal of all lifecycle activities is to reduce the two errors to 0, especially for safety-critical DNNs. This is analogous to the “possible perfection” notion of traditional software as pointed to by Rushby and Littlewood [littlewood_reasoning_2012, rushby_software_2009]. That is, assurance activities, e.g., performed in support of DO-178C, can be best understood as developing evidence of possible perfection – a confidence in . Similarly, for safety critical DNNs, we believe ML lifecycle activities should be considered as aiming to train a “possible perfect” DNN in terms of the reducible Approximation and Estimation errors. Thus, we may have some confidence that the two errors are both 0 (equivalently, a prior confidence in the irreducible Bayes error since the other two are 0, that can be used by CBI), which indeed is supported by on-going research into finding globally optimised DNNs [du_gradient_2018]. Meanwhile, on the trained model, V&V also provides prior knowledge as shown in Ex. 1 below, and online monitoring continuously validates the assumptions for the prior knowledge being obtained.
We present an illustrative example on how to obtain a prior confidence bound on the generalisation error from formal verification of DNN robustness [ruan2018global, HKWW2017]. Robustness requires that the decision making of a neural network cannot be drastically changed due to a small perturbation on the input. Formally, given a real number and a distance measure , for any input , we have that, whenever .
Fig. 4 shows an example of the robustness verification in a one-dimensional space. Each blue triangle represents an input , and the green region around each input represents all the neighbours, of , which satisfy and . Now if we assume
is uniformly distributed (an assumption for illustrative purposes which can be relaxed for other givendistributions), the generalisation error has a lower bound – the chance that the next randomly selected input does not fall into the green regions. That is, if denotes the ratio of the length not being covered by the green regions to the total length of the black line, then . This said, we cannot be certain about the bound due to assumptions like: (i) The formal verification tool itself is perfect, which may not hold; (ii) Any neighbour of has the same ground truth label of . For a more comprehensive list, cf. [burton_confidence_2019]. Assessors need to capture the doubt (say ) in those assumptions, which leads to:
So far, we have presented an instance of the safety argument template in Fig. 5. The solution So2 is the formal verification result showing , and G9 in Fig. 5 quantifies the confidence in that result. It is indeed an open challenge to rigorously develop G8 further, which may involve scientific ways of eliciting expert judgement [ohagan_uncertain_2006] and systematically collecting process data (e.g., statistics on the reliability of verification tools). However, we believe this challenge – evaluating confidence in claims, either quantitatively or qualitatively (e.g., ranking with low, medium, high), explicitly or implicitly – is a fundamental problem for all safety case based decision-makings [denney_towards_2011, bloomfield_confidence:_2007, zhao_new_2012, wang_confidence_2017], rather than a specific problem of our framework.
The sub-goal G9 represents the mechanism of online monitoring on the validity of offline actives, e.g., validating the environmental assumptions used by offline formal verifications against the real environment at runtime [ferrando_verifying_2018].
5 Other Safety Related Properties
So far we have seen a reliability-centric safety case for DNNs. Recall that, in this paper, reliability is the probability of misclassification (i.e. the generalisation error in (1)) that has safety impacts. However, there are other DNN safety related properties concerning risks not directly caused by a misclassification, like interpretability, fairness, and privacy; discussed as follows.
Interpretability is about an explanation procedure to present an interpretation of a single decision within the overall model in a way that is easy for humans to understand. There are different explanation techniques aiming to work with different objects, see [Huangsurvey2018] for a survey. Here we take the instance explanation as an example – the goal is to find another representation of an input , with the expectation that carries simple, yet essential, information that can help the user understand the decision . We use to denote that the explanation is consistent with a human’s explanation in . Thus, similarly to (1), we can define a probabilistic measure for the instance-wise interpretability:
Then similarly as the argument for reliability, we can do statistical inference with the probabilistic measure . For instance, as in Ex. 1, we (i) firstly define the robustness of explanations in norm balls, measuring the percentage of space that has been verified as a bound on , (ii) then estimate the confidence of the robust explanation assumption and obtain a prior confidence in interpretability, (iii) finally Bayesian inference is applied with runtime data.
Fairness requires that, when using DL to predict an output, the prediction remains unbiased with respect to some protected features. For example, a financial service company may use DL to decide whether or not to provide loans to an applicant, and it is expected that such decision should not rely on sensitive features such as race and gender. Privacy is used to prevent an observer from determining whether or not a sample was in the model’s training dataset, when it is not allowed to observe the dataset directly. Training methods such as [Abadi_2016] have been applied to pursue differential privacy.
The lack of fairness or privacy may cause not only a significant monetary loss but also ethical issues. Ethics has been regarded as a long-term challenge for AI safety. For these properties, we believe the general methodology suggested here still works – we first introduce bespoke probabilistic measures according to their definitions, obtain prior knowledge on the measures from lifecycle activities, then conduct statistical inference during the continuous monitoring of the operation.
6 Related Work
Alves et.al [alves_considerations_2018] present a comprehensive discussion on the aspects that need to be considered when developing a safety case for increasingly autonomous systems that contain ML components. In [BKCF2019], a safety case framework with specific challenges for ML is proposed. [SS2020] reviews available certification techniques from the aspects of lifecycle phases, maturity and applicability to different types of ML systems. In [KKB2019], safety arguments that are being widely used for conventional systems – including conformance to standards, proven in use, field testing, simulation and formal proofs – are recapped for autonomous systems with discussions on the potential pitfalls. Similar to our CBI arguments that exploit operational data, [matsuno_tackling_2019, ishikawa_continuous_2018] propose utilising continuously updated arguments to monitor the weak points and the effectiveness of their countermeasures. The work [asaadi_towards_2019] is also interested in quantitative claims in safety assurance arguments, after identifying the applicable quantitative measures of assurance, and characterising the associated uncertainty probabilistically.
Regarding the safety of automated driving, [SS2020b, rudolph_consistent_2018, SC2018] discuss the extension and adaptation of ISO-26262, and [burton_making_2017] considers functional insufficiencies in the perception functions based on DL. Additionally, [picardi_pattern_2019, picardi_perspectives_2019] explores safety case patterns that are reusable for DL in the context of medical applications. While, in [osborne_uas_2019], safety case approach is reviewed as useful for assuring the safety of drones.
Formal verification [HKWW2017, katz2017reluplex, xiang2017output, GMDTCV2018, LM2017, wicker2018feature, RHK2018, wu2018game, ruan2018global, LLYCH2018] and coverage-guided testing [sun2018concolic, PCYJ2017, sun2018testing-b, ma2018deepgauge, SHKSHA2019, sun2018concolicb] currently form the two major classes of V&V techniques for DL, from which a collection of evidence may be obtained that supports the partial prior knowledge. The readers are referred to a recent survey [Huangsurvey2018] for the introduction and summarisation of the techniques.
7 Discussions, Conclusions and Future Work
In this paper, we present a novel safety argument framework for DNNs using probabilistic risk assessment, mainly considering quantitative reliability claims, generalising this idea to other safety related properties. We emphasise the use of probabilistic measures to describe the inherent uncertainties of DNNs in safety arguments, and conduct Bayesian inference to strengthen the top-level claims from safe operational data through to continuous monitoring after deployment.
Bayesian inference requires prior knowledge, so we propose a novel view by (i) decomposing the DNN generalisation error into a composition of distinct errors and (ii) try to map each lifecycle activity to the reduction of these errors. Although we have shown an example of obtaining priors from robustness verification of DNNs, it is non-trivial (and identified as an open challenge) to establish a quantitative link between other lifecycle activities to the generalisation error. Expert judgement and past experience (e.g., a repository on DNNs developed by similar lifecycle activities) seem to be inevitable in overcoming such difficulties.
Thanks to the CBI approach – Bayesian inference with limited and partial prior knowledge – even with sparse prior information (e.g., a single confidence bound on the generalisation error obtained from robustness verification), we can still apply probabilistic inference given the operational data. Whenever there are sound arguments to obtain additional partial prior knowledge, CBI can incorporate them as well, and reduce the conservatism in the reasoning [bishop_toward_2011]. On the other hand, CBI as a type of proven-in-use/field-testing argument has some of the fundamental limitations as highlighted in [KKB2019, johnson_increasing_2018], for which we have identified on-going research towards potential solutions.
We concur with [KKB2019] that, despite the dangerous pitfalls for various existing safety arguments, credible safety cases require a heterogeneous approach. Our new quantitative safety case framework provides a novel supplementary approach to existing frameworks rather than replace them. We plan to conduct concrete case studies and continue to work on the open challenges identified.
This document is an overview of UK MOD (part) sponsored research and is released for informational purposes only. The contents of this document should not be interpreted as representing the views of the UK MOD, nor should it be assumed that they reflect any current or future UK MOD policy. The information contained in this document cannot supersede any statutory or contractual requirements or liabilities and is offered without prejudice or commitment.
Content includes material subject to © Crown copyright (2018), Dstl. This material is licensed under the terms of the Open Government Licence except where otherwise stated. To view this licence, visit http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: email@example.com.