Since the dawn of human history, humans have designed, implemented and adopted tools to make it easier to perform tasks, often improving efficiency, safety, or security. Indeed, recent studies show a direct relationship between increasing technological complexity, cognitive evolution and cultural variation .
When such tools were simple, the person using the tool had full control over the way the tool should be operated, understood why it worked in that way, knew how the tool should be used to comply with existing rules, and when such rules might be broken if the situation demanded an exceptional use of the tool. For example, our early ancestors could use a hammer for building artefacts, knew why the hammer could be used for their purposes, followed the rules of not using it as a weapon against other humans, but might have chosen to break this rule if their families were in danger (Figure 1).
However, as tools became more complex and developed into systems composed of many different parts, users lost their broad view on how the system, or even some of its components, worked and – without that know-how – they lost part of their control over the system. But users still retained the capability of using systems following the rules, and breaking the rules if needed. By delegating the control of some basic tasks to the system itself, users gained in efficiency at the expense of exhaustive control (Figure 2).
Nowadays, the sophisticated systems that we rely on have become so complex that our awareness of what actually happens when we exploit some of their functionality is often close to zero. For example, how many people know how a cloud storage system works? Or the complex link between a vehicle’s brake pedal and the vehicle speed? Even if we are domain experts, we barely know the complete event/data flow initiated by just pressing one button. This is even more true with the rise of auto-* and self-* systems (auto-pilots, self-driving cars, self-configuring industrial equipment, etc). We therefore can no longer just delegate the control of basic operations. If we want a car to drive by itself, we must also delegate to it the requirement to follow the road traffic rules (Figure 3).
So far, however, a self-driving car is neither designed nor expected to make decisions in moral-ethical situations . When ethics, and even merely outside-of-scope situations, bear upon autonomous operation, the human must still be responsible . As an example, if a self-driving car has a mechanical/software failure in a dangerous situation or if it encounters a safety dilemma, responsibility is transferred to the human.
Nevertheless, due to the delegation of more and more capabilities from humans to machines, the scenario depicted in Figure 4 – where the human is replaced by an autonomous system – is becoming more realistic. This scenario of full autonomy raises many ethical, legal, social, methodological and technical issues. In this article we address the crucial question: “How can the reliability of such an autonomous software system be certified?”
Before exploring this challenging question, we need to define the terminology used in the sequel. By ‘we’ this article means the authors. When we want to indicate some more general class of individuals, such as ‘the scientific community’, or ‘humankind’, we will explicitly use those terms.
We start with reliability. The term ‘reliable’ means ‘suitable or fit to be relied on’ . For systems offering a service or function, reliability means that the service or function is available when needed. A software system is reliable to the extent that it meets its requirements consistently, namely that it makes good decisions in all situations. In some situations, a good decision is simply one that follows given rules, for instance, choosing to stop at a red traffic light. However, in other, hopefully rare situations, rules may need to be overridden, for instance, temporarily driving on the wrong side of the road to avoid a crash.
Answering the question of what a ‘good’ decision is out of the scope of this paper. Ethical decision making has been widely studied by psychologists and philosophers such as Lawrence Kohlberg who developed the theory of stages of moral development [153, 154, 155], and different cultures have a different attitude towards the notion of ‘good’ decision. Our contribution is not on the philosophical challenges raised by the question, but on the technological ones.
Reliability is often associated with the notion of a certification, ‘a proof or a document proving that someone is qualified for a particular job, or that something is of good quality’ ; besides the document, certification also refers to ‘the process of giving official or legal approval to a person, company, product, etc., that has reached a particular standard’ . Human professionals can be certified, and the idea is not new: guilds of arts and crafts were born in the 12th century in many European cities, to regulate and protect the activities of those belonging to the same professional category . Being part of a guild was a certification of the craftman’s or merchant’s professional skills. As soon as machines partially or completely supplemented professionals, the need to certify machines arose – at least in terms of safety, if not functionality; this is also true of software. Certification of software reliability is a lively research area in software engineering, as will be discussed in Section 2.2.
We define a system to be autonomous if it can make its own decisions and act on them, without external (human) supervision and control. For example, a mobile robot can be completely remote-controlled, in which case it is not autonomous, or it can have a built-in control unit that decides on its moves, such that it becomes semi-autonomous. Of course, the boundary separating fully autonomous from non-autonomous systems is not black and white. For example, the robot may be allowed some degree of autonomy, e.g., in path planning, whereas the overall movement goal is imposed by some remote controller.
The levels of autonomy that we will use to classify examples of systems from different domains in Section
The levels of autonomy that we will use to classify examples of systems from different domains in Section4, roughly follow the six-grade scale given for autonomous road vehicles by SAE International , though, e.g., that standard does not include our ‘low’ layer:
No autonomy: The operator is responsible for all tasks.
Low autonomy: Straightforward (but non-trivial) tasks are done entirely autonomously (no human poised to take over operation).
Assistance systems: The operator is assisted by automated systems, but either remains in control to some extent or must be ready to take back control at any time.
Partial autonomy: The automated system takes full control of the system, but the operator must remain engaged, monitor the operation and be prepared to intervene immediately.
Conditional autonomy: The automated system has full control of the operation during specified tasks; the operator can safely turn their attention away but must still be prepared to intervene upon request.
High autonomy: The automated system is capable of performing all planned functions under certain circumstances (e.g., within a certain area); the operator may safely leave the system alone.
Full autonomy: The system can perform all its intended tasks on its own, no human intervention is required at any time.
In addition to defining the level of autonomy, we also consider the scope of autonomy. This is the level of functionality of the system’s autonomous capabilities. For example, one vacuum cleaner might have autonomous capabilities that only encompass traversing a space and avoiding obstacles, while another, more sophisticated model, may also be able to schedule its cleaning to avoid disruption to the human’s schedule. We would say that the second model has greater scope of autonomy. The scope and level of autonomy can sometimes be a tradeoff: increasing the scope may involve the system doing things that it cannot do fully autonomously, whereas a system with more limited scope may be able to have higher autonomy.
We are particularly interested in fully autonomous systems that can also make their own decisions on safety-critical actions, namely actions whose failure could result in loss of life, significant property damage or damage to the environment.111Adapted from the definition of ‘Safety Critical System’ provided by J. C. Knight in . Additionally, autonomous systems are often characterised by the need to balance pursuing objectives over a long time period (being proactive), with responding to environmental and system changes (being reactive).
In the sequel, we will also make strong use of the notion of an ‘agent’. An autonomous software agent (‘agent’ for short) is an autonomous software system that captures the ability to decide or act independently, while also balancing between being proactive and reactive. We follow standard usage in the field in defining a multiagent system as a system that includes multiple such agents, which may interact in various ways (e.g., communicating using messages or via the environment): see the seminal works by Jennings N. R., Sycara K. P., and Wooldridge M. [148, 232, 233]. Finally, we consider rational agents as those that are structured in terms of intentional concepts, such as goals, beliefs, and intentions (synonymously, the terms ‘cognitive agent’ or ‘intelligent agent’ are sometimes used in the literature ).
Figure 5 compares different domains of autonomous systems in terms of the expected autonomy and available regulation. Although the scope222We use the scope of autonomy in this figure, rather than the level of autonomy, because for the systems considered there is a tradeoff: the systems vary in the scope of autonomy, but for many of these systems the scope is set (by designers) in order to allow the system to display high levels of autonomy, making scope of autonomy a more useful differentiator than the level of autonomy. of (expected) autonomy and the level of regulation cannot be measured precisely, the figure highlights that there are systems (top left quadrant) with considerable scope for autonomy, but limited available regulation. These are the sorts of systems that particularly require work to be able to shift them across to the right by increasing the available regulation. We discuss each of these domains in Section 4, with the exception of remote surgical robots, since there is not enough autonomy permitted to such systems.
It is worth noting that although many systems can be viewed as being rational agents, we only do so when there is benefit in adopting an intentional stance and viewing the system in these terms. For example, a thermostat makes decisions, and we could ascribe it a goal to keep the room at a fixed temperature. However, the behaviour of a thermostat is simple enough that there is no benefit to viewing it in terms of goals and beliefs .
It is important to highlight that, for purposes of certification, or other regulatory procedures, we sometimes need to consider not just what a software system did, but also why it did it. For instance, there is a difference between a car breaking the speed limit because it has an incorrect belief about the speed limit, and a car going too fast because it believes that temporarily speeding is the best, or even the only, way to avoid an accident.333Here we see the interplay between norms of different types : current jurisdiction in Germany at the time of writing is that one is not allowed to transgress speed limits even in life-threatening situations. The argument is that even in order to save a human life one is not supposed to endanger another one.
1.2 Audience, Contributions and Structure
This article assesses what is needed in order to provide verified reliable behaviour of an autonomous system, analyses what can be done as the state of the art in automated verification, and proposes a roadmap towards developing certification and broader regulation guidelines.
This article thus has three audiences. Firstly, we address regulators, who might find the proposed roadmap useful as a path towards being able to meaningfully regulate these sorts of systems. Secondly, engineers and developers who develop such systems might find it useful in seeing how/where these systems need greater analysis. Thirdly, academic researchers can advance the state of the art by finding better ways of dealing with the challenges that we articulate.
We advance the literature by
proposing a framework for viewing (and indeed building) autonomous systems in terms of three layers;
showing that this framework is general, by illustrating its application to a range of systems, in a range of domains;
discussing how certification/regulation might be achieved, breaking it down by the three layers; and
articulating a range of challenges and future work, including challenges to regulators, to researchers, and to developers.
The remainder of the article is structured as follows. Section 2 reviews the current situation in terms of regulation and certification of (semi-)autonomous systems, and of the issues still open. Section 3 assesses what could be done in the near future; it develops our three-layer reference framework, discusses what we need from regulators, proposes a process for deriving verification properties, and reviews in more detail verificaton techniques. Section 4 discusses a set of case studies in different application domains. Section 5 looks at challenges in research, engineering and regulation. Section 6 summarises and indicates future directions.
2 Looking Back
All systems, be they autonomous or not, that operate in a human society need to conform to some legal requirements. These legal requirements may be generic and apply to all products, or specific. Often these requirements are based on regulations, that we define as ‘rules, policies and laws set out by some acknowledged authority to ensure the safe design and operation of systems’.444The definition provided by the Cambridge Dictionary is ‘the rules or systems that are used by a person or organisation to control an activity or process’ . We customise this definition for systems that may perform safety-critical actions.
Relating to the concept of regulation, in the context of this paper certification can be specified as ‘the determination by an independent body that checks whether the systems are in conformity or compliant with the above regulations’. Certification involves a legal, rather than scientific, assessment and usually appeals to external review, typically by some regulator.
The certification processes, and hence regulators, in turn appeal to standards, namely documents (usually produced by a panel of experts) providing guidance on the proving of compliance.
There are a plethora of different standards, issued by a wide range of different standardisation organisations. Amongst the most well known are CENELEC , IEC , IEEE  and ISO , to name just a few. Many of these organisations provide generic standards relevant across many (autonomous) system domains. For particular sectors, the regulatory bodies – and there may be several for each sector – have a range of specific standards. In Section 2.1 we present some of the most relevant existing standards, and in Section 2.2 we overview some methods and tools suitable for certification of software systems. It is important to note, however, that nowadays moving from written standards to formal specifications that can be fed to tools able to check, verify, and certify the system’s behaviour, is not possible. Also, most existing standards say little, if anything, about ‘autonomy’ and ‘uncertainty’, the situation where autonomy is more needed, but also more dangerous. Nevertheless, they prescribe important properties that systems should aim to comply with. Section 2.3 faces some issues raised by autonomous systems, which are not (yet) satisfactorily addressed by current standards and regulations, including how we might link together the achievements described in the first two sections, and how we might deal with autonomy and uncertainty.
Tables 1 to 5 present some standards grouped by domains where autonomy potentially plays a crucial role. The most sophisticated illustrative examples in Section 4 are taken from these domains. We do not claim this to be either exhaustive or systematic: this section is only meant to give the reader an idea of the complexity and wide variety of existing standards by providing examples issued by different organisations. It is important to note that there is a vast array of standards, many of which are never used by any regulator.
Table 1 illustrates some standards in the robotics domain. Most of them come from ISO. A Technical Committee of the ISO created in 2015  is in charge of the standardisation of different robotics fields, excluding toys and military applications. In 2015, IEEE developed an ontology for agreeing on a shared terminology in robotics, and delivered it as a standard.
|ISO 13482 ||Robots and robotic devices – Safety requirements for personal care robots||2014||Requirements and guidelines for the inherently safe design, protective measures, and information for use of personal care robots, in particular mobile servant robot; physical assistant robot; person carrier robot.|
|IEEE 1872 ||IEEE Standard Ontologies for Robotics and Automation||2015||A core ontology that specifies the main, most general concepts, relations, and axioms of robotics and automation, intended as a reference for knowledge representation and reasoning in robots.|
|ISO/TS 15066 ||Robots and robotic devices – Collaborative robots||2016||Safety requirements for collaborative industrial robot systems and the work environment, supplementing the requirements and guidance on collaborative industrial robot operation given in ISO 10218-1 and ISO 10218-2.|
|ISO/TR 20218-1, ISO/TR 20218-2 [145, 143]||Robotics – Safety design for industrial robot systems – Part 1 (End-effectors) & Part 2 (Manual load/unload stations)||2017, 2018||Applicable to robot systems for manual load/unload applications in which a hazard zone is safeguarded by preventing access to it, and both access restrictions to hazard zones and ergonomically suitable work places must be considered. Guidance on safety measures for the design and integration of end-effectors used for robot systems.|
|ISO/TR 23482-2 ||Robotics – Application of ISO 13482 – Part 2: Application guidelines||2019||Guidance on the use of ISO 13482 to facilitate the design of personal care robots in conformity with ISO 13482, including new terms and safety requirements introduced to allow close human-robot interaction and human-robot contact in personal care robot applications.|
Table 2 summarises some facts of one IEC standard dealing with medical equipment. Many standards in this domain exist, also delivered by ISO which issued more than 400 standards focusing on health  thanks to three Technical Committees dealing with medical equipment [129, 130, 131] and one dealing with health informatics . We selected  as an example from the medical technologies domain, because it focusses on equipments with ‘a degree of autonomy’.
|IEC/TR 60601-4-1 ||Medical electrical equipment – Part 4-1: Guidance and interpretation||2017||Guidance to a detailed risk management and usability engineering processes for medical electrical equipment (MEE) or a medical electrical system (MES), employing a degree of autonomy (DOA) & guidance on considerations of basic safety and essential performance for an MEE and MES with a DOA.|
Nearly 900 ISO standards have been developed for the automotive sector . One of the influential is the ISO 26262 , born as an adaptation of the Functional Safety standard IEC 61508 for Automotive Electric/Electronic Systems . Published in 12 individual parts, ISO 26262 has been updated in 2018 to keep abreast of today’s new and rapidly evolving technologies, and be relevant to even more applications. IEEE is also developing standards in the automotive sector, ranging from public safety in transportation-related events  to system image quality . More than three dozen IEC technical committees and subcommittees cover the standardisation of equipment used in and related to road vehicles as well as of other associated issues. As an example, the IEC TC 69  is preparing international standards for road vehicles, totally or partly electrically propelled from self-contained power sources, and for electric industrial trucks. Table 3 presents one standard for each of the three organisations above, ISO, IEEE, and IEC.
|IEEE-P2020 ||Standard for Automotive System Image Quality||2016||This standard addresses the fundamental attributes that contribute to image and quality for automotive Advanced Driver Assistance Systems applications, as well as identifying existing metrics and other useful information relating to these attributes.|
|ISO 26262 ||Road vehicles – Functional safety||2018||Safety is one of the key issues in the development of road vehicles. With the trend of increasing technological complexity, software content and mechatronic implementation, there are increasing risks from systematic failures and random hardware failures, these being considered within the scope of functional safety. The ISO 26262 series of standards includes guidance to mitigate these risks by providing appropriate requirements and processes.|
|IEC 63243 ED1 ||Interoperability and safety of dynamic wireless power transfer (WPT) for electric vehicles||2019||The draft of this standard, develped by the IEC TC 69, will be circulated at the end of 2019. It will specify definition and conditions of interoperability and safety for magnetic-field dynamic WPT for electric vehicles and the associated safety requirements.|
Compared to other domains, railway homologation and operation is strictly regulated. The IEC Technical Committee 9  is responsible for the international standardisation of the electrical equipment and systems used in railways. The ISO Technical Committee 269  complements IEC TC 9 by addressing the standardisation of all systems, products and services specifically related to the railway sector, not already covered by IEC TC 9. Both work in close relationship with the International Union of Railways (UIC, ) and the International Association of Public Transport (UITP, ). Through the CENELEC 50128 standard , CENELEC assesses the conformity of software for use in railway control which may have impact on safety, i.e., software whose failures can affect safety functions. Table 4 exemplifies standards in the railway sector by presenting one standard from ISO dealing with project management, one series from IEC dealing with reliability, availability, maintainability and safety, and the CENELEC 50128 standard.
|IEC 62278 series [119, 121, 122]||Railway applications – Specification and demonstration of reliability, availability, maintainability and safety (RAMS)||2002, 2010, 2016||The documents under the IEC 62278 umbrella provide Railway Authorities and railway support industry with a process which will enable the implementation of a consistent approach to the management of reliability, availability, maintainability and safety (RAMS). The process can be applied systematically by a Railway Authority and railway support industry, throughout all phases of the life cycle of a railway application, to develop railway specific RAMS requirements and to achieve compliance with these requirements.|
|CENELEC 50128 ||Railway applications – Communication, signalling and processing systems – Software for railway control and protection systems||2011||Specification of the process and technical requirements for the development of software for programmable electronic systems for use in railway control and protection applications, aimed at use in any area where there are safety implications.|
|ISO/TR 21245 ||Railway applications – Railway project planning process – Guidance on railway project planning||2018||Guidance on railway project planning for decision making, based upon the principles of ISO 21500 , by incorporating characteristics specific to railway projects. The document is meant to be used by any type of organisation and be applied to any type of railway project, irrespective of its complexity, size, duration. It provides neither detailed requirements nor specific processes for certification.|
The quantity of existing standards in the aerospace domain is huge. Established in 1947, ISO/TC 20  is one of the oldest and most prolific ISO technical committees. IEEE has published nearly 60 standards dealing with aerospace electronics, and IEC has two Technical Committees dealing with avionics-related issues [117, 116]: these committees developed about 30 standards. Other relevant standards bodies must be mentioned as well. The mission of the European Union Aviation Safety Agency (EASA, ) is to ensure the highest common level of safety protection for EU citizens and of environmental protection; to provide a single regulatory and certification process among Member States; to facilitate the internal aviation single market and create a level playing field; and to work with other international aviation organisations and regulators. The US Federal Aviation Administration (FAA, ) summarises its mission as ‘to provide the safest, most efficient aerospace system in the world’. Finally, the US Radio Technical Commission for Aeronautics (RTCA, ) aims at being ‘the premier Public-Private Partnership venue for developing consensus among diverse and competing interests on resolutions critical to aviation modernisation issues in an increasingly global enterprise’. In Table 5 we present standards from EASA, FAA, and RTCA, including two standards dealing with Unmanned Aircraft Systems and drones.
|RTCA DO-254 ||Design Assurance Guidance for Airborne Electronic Hardware||2000||This document is intended to help aircraft manufacturers and the suppliers of aircraft electronic systems assure that electronic airborne equipment safely performs its intended function. The document also characterises the objective of the design life cycle processes and offers a means of complying with certification requirements.|
|RTCA DO-333 ||Formal Methods Supplement to DO-178C and DO-278A||2011||Additions, modifications and substitutions to DO-178C (see below) and DO-278A  objectives when formal methods are used as part of a software life cycle, and the additional guidance required. It discusses those aspects of airworthiness certification that pertain to the production of software, using formal methods for systems approved using DO-178C.|
|RTCA DO-178B, DO-178C/ED-12C [190, 194]||Software Considerations in Airborne Systems and Equipment Certification||2012||Recommendations for the production of software for airborne systems and equipment that performs its intended function with a level of confidence in safety that complies with airworthiness requirements. Compliance with the objectives of DO-178C is the primary means of obtaining approval of software used in civil aviation products.|
|FAA Part 107 ||Operation and Certification of Small Unmanned Aircraft Systems||2016||Addition of a new part 107 to Title 14 Code of Federal Regulations  to allow for routine civil operation of small Unmanned Aircraft Systems (UAS) in the National Airspace System and to provide safety rules for those operations. The rule limits small UAS to daylight and civil twilight operations with appropriate collision lighting, confined areas of operation, and visual-line-of-sight operations.|
|Regulation (EU) 2018/1139 ||Regulation (EU) 2018/1139 of the European Parliament and of the Council of 4 July 2018||2018||First EU-wide regulations for civil drones with a strong focus on the particular risk of the operations. The regulations take into account the expertise of many international players in the drone domain; they will allow remotely piloted aircraft to fly safely in European airspace and bring legal certainty for this rapidly expanding industry.|
Having reviewed relevant standards in various domains, we next turn to briefly reviewing techniques for certification of software systems.
2.2 Certification of Traditional Software Systems
In the late 1980s, with software applications becoming more and more pervasive and safety-critical, many scientists began to address the problem of certifying them. One of the first papers in this research strand was ‘Certifying the Reliability of Software’ 
. It proposed a certification procedure consisting of executable product increments, representative statistical testing, and a standard estimate of the mean time to failure of the system product at the time it was released. Subsequently, Wohlin C. and Runeson P. presented a more mature method of certification, consisting of five steps, and suitable for certification of both components and full systems: 1) Modelling of software usage, 2) derivation of usage profile, 3) generation of test cases, 4) execution of test cases and collection of failure data, and 5) certification of reliability and prediction of future reliability. Further, Poore J. H., Mills H. D., and Mutchler D. also pointed out that certification should be based on first generating inputs according to the system’s intended use, and then conducting statistical experiments to analyse them. The idea that ‘if a component has been verified by a mathematical proof of correctness, you may be able to attribute a high degree of reliability to it’ was explicitly stated there . This paved the way to works where the software certification of safety-critical systems was based on formal methods. It is worth noting that the already mentioned IEC 61508 standard  recommends that formal methods be used in software design and development in all but the lowest Safety Integrity Levels.
Among this wide range of work, we mention those by Heitmeyer C. L. et al. [92, 93], where certification is achieved by annotating the code with preconditions and postconditions, exploiting a five-step process for establishing the property to be verified, and finally demonstrating that the code satisfies the property. In contrast to previous work by Benzel T.  and by Whitehurst R. A. and Lunt T. F. [27, 226] in the operating systems and database domains, Heitmeyer C. L. et al. addressed the problem of making the verification of security-critical code affordable.
Many mature formal and semi-formal techniques are widely used to certify software: model checking, theorem proving, static analysis, runtime verification, and software testing. While these techniques are consolidated and are indeed met while looking back, we discuss them in Section 3.4 where we present our vision of the future. The reason is that their adoption is crucial for certifying systems that are autonomous. Besides introducing them, in Section 3.4 we compare them along the five dimensions of inputs, outputs, strengths, weaknesses, and applicability with respect to our reference three-layer framework presented in Section 3.1.
As observed in surveys (e.g., [5, 53]), other approaches to software certification have been exploited, besides (semi-)formal methods. Among them, ‘assurance cases’ were proposed as a viable way to certify safety-critical applications .
An assurance case is an organised argument that a system is acceptable for its intended use with respect to specified concerns (such as safety, security, correctness).
rinehart2017understanding . analysed 82 works published between 1994 and 2016, concluding that transportation, energy, medicine, and military applications are the areas where assurance cases are more widely applied.
The adoption of assurance cases is rapidly spreading both in academic works and in industry: in their recent work , Denney E. and Pai G. present the AdvoCATE toolset for assurance case automation developed by NASA, and overview more than twenty research and commercial tools suitable for creating structured safety arguments using Claims-Argument-Evidence (CAE) notation , and/or Goal Structuring Notation (GSN) diagrams .
As a final remark, we observe that software certification is so challenging that the adoption of the same method or tool across different domains is often impossible. Hence, many domain-dependent proposals exist such as for robotics , medical systems , the automotive sector [15, 237], unmanned aircraft [219, 225], and railway systems .
2.3 Open Issues in the Certification of Autonomous Systems
Current standards and regulations are not ready for coping with autonomous systems that may raise safety issues, and hence need to undergo a formal process to be certified. One main issue in their adoption is the format in which standards are currently specified: textual descriptions in natural language. The second issue is the lack of consideration, and sometimes even of clear understanding, of the ‘autonomy’ and ‘uncertainty’ notions. Sections 2.3.1 and 2.3.2 discuss these two issues, respectively.
2.3.1 Certifying Systems against Textual Descriptions and System Runs
Let us suppose that the approach followed for the certification process is based on verification. Verifying – either statically or dynamically – scientific and technical requirements of complex and autonomous software applications is far from being an easy task but, at least, formalisms, methodologies and tools for representing and processing such requirements have been studied, designed and implemented for years, within the formal methods and software engineering communities.
When requirements have a legal or even ethical connotation, such as the standards discussed in Section 2.1, their verification may be harder, if not impossible. Such ‘legal requirements’ are written in natural language: in order to verify that a system complies with them, a step must be made to move from the natural language to a formal one.
The literature on this topic is vast, but running existing algorithms on existing standards, and expecting to get a clean, consistent, complete formal specification ready to be verified, is a hopeless task. For example, ARSENAL  converts natural language requirements to formal models in SAL , a formal language for specifying transition systems in a compositional way, and in LTL. Although equipped with a powerful analysis framework based on formal methods, and despite its ability to generate a full formal model directly from text, ARSENAL has documented limitations when dealing, for example, with different ways to express negation and with co-locations like ‘write access’. Also, looking at the paper, the rules used to feed ARSENAL seem to follow a simple and regular pattern, with ‘if’ and ‘when’ conditions clearly defined.
Other works address similar problems in the software engineering research area [57, 188, 238], in the agricultural regulation domain , and – up to some extent – in the digital forensics field , but the results are far from being applicable to complex, unstructured, heterogeneous standard specifications.
Process mining  is an emerging discipline aimed at discovering precise and formal specifications of processes, based on data generated by instances of those processes. It builds on process model-driven approaches and data mining. There are many ways business processes can be represented using formal languages. Most of them are inspired by Petri Nets , but there are also proposals for formalisms based on LTL , that could be directly used to feed a model checker or a runtime monitor. However, in order to certify the system, the scientists in charge for the certification process would need:
logs of real executions of the system, to mine a precise representation of its functioning (namely, a model of the system’s behaviour),
properties that the process must verify, either represented in a logical form or translated into a logical form from a natural language description, using the techniques presented above, and
a model checker, for checking that the properties are verified by the system’s model.
Even if all the three items above were available, the certification would just state that the observed real executions from which the model was extracted, met the properties. Nothing can be stated on all the other not yet observed, but still possible runs of the system. The main challenge raised by this scenario is in fact that, being data-driven, the mined model only covers the already observed situations. It is an approximate specification of how a system behaves in some normal operational scenarios meeting the rules, and in some scenarios where rules are broken.
At the current state of the art, certifying large and complex (autonomous) systems agains standards based on textual descriptions and system runs is out of reach, and not only because of scientific and technical obstacles: current regulations are indeed not appropriate for autonomous systems. We note the breadth of (mainly academic) work tackling formal methods for (autonomous) robotic systems  and would expect this to impact upon regulation and certification in the future, to make them aligned with the developments in the autonomous systems area.
2.3.2 Dealing with Autonomy and Uncertainty
The standards and regulatory frameworks described in Section 2.1 essentially apply to existing systems, but lack some aspects we would expect of future, more complex and autonomous systems. The first issue is uncertainty, the second is autonomy. Let us deal with each in turn.
Current approaches to certification and regulation often assume that
there is a finite set of potential hazards/failures,
that these can all be identified beforehand, and
that this finite set will not change over the lifetime of the system.
If all the above are true then a risk/mitigation based approach can be used since we know what problems can occur.
However, as we move to much more complex environments where we cannot predict every (safety) issue then the above assumptions become problematic. And then, as we provide more AI components, such as online learning modules, we are not only unsure of what the environment will look like but also unsure of what behaviours our system will have (since it might have learnt new ones). All these issues pose severe problems for the current techniques for identifying hazards/faults, assessing risk/mitigation, and building safety-cases.
In more sophisticated systems, such as a domestic robotic assistant with healthcare and social responsibilities, improved ways of regulating such systems will likely have to be constructed. Without such improvements, the existing approaches will impose the above assumptions, stifling application in all but the most static environments.
A second issue is that the concept of ‘autonomy’ is not well understood in existing standards/regulations. The standards mentioned so far regulate the requirements, behaviour, and development process of complex and sophisticated systems. These systems may show some degree of autonomy, but that is not their most characterising feature. The standards are neither driven nor strongly influenced by it. Indeed, the issue of ‘autonomy’ has been conspicuously absent from most existing standards, as well as the ethical issues that it raises. There are only a few, very recent exceptions.
In 2016, the British Standards Institution (BSI, ) developed standards on ethical aspects of robotics. The BS 8611 standard provides a guide to the Ethical Design and Application of Robots and Robotic Systems . As stated in its overview:
BS 8611 gives guidelines for the identification of potential ethical harm arising from the growing number of robots and autonomous systems being used in everyday life.
The standard also provides additional guidelines to eliminate or reduce the risks associated with these ethical hazards to an acceptable level. The standard covers safe design, protective measures and information for the design and application of robots.
The new standard builds on existing safety requirements for different types of robots, covering industrial, personal care and medical.
While the BSI feeds in to ISO standards, the above ethical standard has not yet been adopted by ISO.
In a large, international initiative, the IEEE, through its Global Initiative on Ethics of Autonomous and Intelligent Systems , has begun to develop a range of standards tackling autonomy, ethical issues, transparency, data privacy, trustworthiness, etc. These standards are still in their early stages of development; Table 6 provides references to those which are more closely related to autonomous systems. The year reported in the table is the Project Authorisation Request (PAR) approval date.
|IEEE P7000 ||Model Process for Addressing Ethical Concerns During System Design||2016||Process model by which engineers and technologists can address ethical consideration throughout the various stages of system initiation, analysis and design.|
|IEEE P7001 ||Transparency of Autonomous Systems||2016||This standard describes measurable, testable levels of transparency, so that autonomous systems can be objectively assessed and levels of compliance determined.|
|IEEE P7002 ||Data Privacy Process||2016||Requirements for a systems/software engineering process for privacy oriented considerations regarding products, services, and systems utilising employee, customer or other external user’s personal data.|
|IEEE P7003 ||Algorithmic Bias Considerations||2017||Specific methodologies to help users certify how they worked to address and eliminate issues of negative bias in the creation of their algorithms, where ‘negative bias’ infers the usage of overly subjective or uniformed data sets or information known to be inconsistent with legislation or with instances of bias against groups not necessarily protected explicitly by legislation.|
|IEEE P7006 ||Standard for Personal Data Artificial Intelligence (AI) Agent||2017||Technical elements required to create and grant access to a personalised Artificial Intelligence (AI) that will comprise inputs, learning, ethics, rules and values controlled by individuals.|
|IEEE P7007 ||Ontological Standard for Ethically Driven Robotics and Automation Systems||2017||The standard establishes a set of ontologies with different abstraction levels that contain concepts, definitions and axioms which are necessary to establish ethically driven methodologies for the design of Robots and Automation Systems.|
|IEEE P7008 ||Standard for Ethically Driven Nudging for Robotic, Intelligent and Autonomous Systems||2017||‘Nudges’ as exhibited by robotic, intelligent or autonomous systems are defined as overt or hidden suggestions or manipulations designed to influence the behaviour or emotions of a user. This standard establishes a delineation of typical nudges (currently in use or that could be created).|
|IEEE P7009 ||Standard for Fail-Safe Design of Autonomous and Semi-Autonomous Systems||2017||Practical, technical baseline of specific methodologies and tools for the development, implementation, and use of effective fail-safe mechanisms in autonomous and semi-autonomous systems.|
Many efforts in the “ethics of autonomous systems” research field converged in the Ethically Aligned Design document released in 2019 : the document is the result of an open, collaborative, and consensus building approach lead by the IEEE Global Initiative. While not proposing any rigorous standard, it makes recommendations on how to design ‘ethics aware’ so-called ‘autonomous and intelligent systems’ (A/IS), and provides reasoned references to the IEEE P70** standards and to the literature.
To give an example, one of the eight general principles leading the A/IS design is transparency – the basis of a particular A/IS decision should always be discoverable. The associated recommendation is as follows.
“A/IS, and especially those with embedded norms, must have a high level of transparency, from traceability in the implementation process, mathematical verifiability of its reasoning, to honesty in appearance-based signals, and intelligibility of the system’s operation and decisions.” [216, Page 46]
While this represents a very good starting point towards agreeing on which behaviour an A/IS should exhibit, certifying that an A/IS has a high level of transparency, based on the recommendation above, is not possible. Moving from well known and clear rules written in natural language to their formal counterpart is hard, and formalising recommendations is currently out of reach, as discussed in Section 2.3.1.
3 Ways Forward
What is the way forward? There are a number of elements that we can bring together to address and support regulatory development. These span across:
architectural/engineering issues — constructing an autonomous system in such a way that it is amenable to inspection, analysis, and regulatory approval,
requirements/specification issues — capturing exactly how we want our system to behave, and what we expect it to achieve, overcoming the difficulties arising when human-level rules do not already exist, and
verification and validation issues — providing a wide range of techniques, across different levels of formality, that can be used either broadly across the system, or for specific aspects.
This second item is particularly important : if we do not know what is expected of the system, then how can we verify it? In traditional systems, the expected behaviour of the human component in the overall system, be they a pilot, driver, or operator, is often under-specified. There is an assumption that any trained driver/pilot/operator will behave professionally, yet this is never spelled out in any system requirement. Then, when we move to autonomous systems, where software takes over some or all of the human’s responsibilities, the exact behaviour expected of the software is also under-specified. Consequently, this leads to a requirement for greater precision and level of detail that we require from regulatory authorities and standards.
This section presents an outline for a way forward, covering the three elements. Firstly, a key (novel) feature of the approach proposed is a three-layer framework (Section 3.1) that separates dealing with rule-compliant behaviour in ‘normal’ situations from dealing with abnormal situations where it may be appropriate to violate rules. For example, a system might consider driving on the wrong side of the road if there is an obstacle in its way and it is safe to use the other lane. Secondly, we consider what we need from regulators (Section 3.2) and define a process for identifying properties to be verified by considering how humans are licensed and assessed (Section 3.3). Thirdly, we review existing verification techniques (Section 3.4), including their strengths, weaknesses, and applicability.
3.1 A Reference Three-Layer Autonomy Framework
In order to distinguish the types of decisions made by autonomous systems, we present a reference three-level framework555We use ‘framework’ rather than ‘architecture’ for two reasons. Firstly, to avoid confusion with an existing (but different) three layer architecture for robots. Secondly, because this framework may not be realised in terms of a software architecture that follows the same three layers. for autonomy in Figure 6. This brings together previous work on:
The separation of high-level control from low-level control in systems architectures.
This is a common trend amongst hybrid systems, especially hybrid control systems, whereby discrete decision/control is used to make large (and discrete) step changes in the low-level (continuous) control schemes .
The identification and separation of different forms of high-level control/reasoning.
Separate high-level control or decision making can capture a wide range of different reasoning aspects, most commonly ethics [16, 41] or safety . Many of these high-level components give rise to governors/arbiters for assessing options or runtime verification schemes for dynamically monitoring whether the expectations are violated.
The verification and validation of such architectures as the basis for autonomous systems analysis.
Fisher, Dennis, and Webster use the above structuring as the basis for the verification of autonomous systems . By separating out low-level control and high-level decision making diverse verification techniques can be used and integrated . In particular, by capturing the high-level reasoning component as a rational agent, stronger formal verification in terms of not just ‘what’ and ‘when’ the system will do something but ‘why’ it chooses to do it can be carried out, hence addressing the core issue with autonomy .
Our reference three-layer autonomy framework consists of:
Reactions Layer— involving adaptive/reactive control/response aspects essentially comprised of low-level (feedback) interactions — behaviour is driven by this interacting with its environment [e.g: ‘autopilot’],
Rules Layer— involving specific, symbolically-represented descriptions of required behaviours — these behaviours are tightly constrained by rules [e.g: ‘rules of the air’],
Principles Layer— involving high-level, abstract, sometimes philosophical, principles, often with priorities between them — here specific behaviour is not prescribed but principles that can be applied to new/unexpected situations are provided [e.g: ‘airmanship’].
We here split the high-level reasoning component further, into rule-following decisions and decisions based on principles (such as ethics). We distinguish these in that the former matches the required rules/regulations that the system should (normally) abide by while the latter is a layer comprising reasoning processes that are invoked when exceptional/unanticipated situations arise (and for which there are no prescribed regulations).
The key novelty here is the distinction between the normal operation where rules are followed (Rules Layer), and (unusual) situations where the autonomous agent needs to reason about whether to violate rules, using, e.g., ethical reasoning (Principles Layer).
3.2 What is Needed from Regulators
Currently, in most countries regulation responsibilities are distributed between legislation, providing the general framework within which an autonomous system is allowed to operate, and public authorities, which are responsible for providing detailed rules and to supervise the conformance of systems to these rules. In this paper, we focus on rule-making by regulatory agencies. That is, we do not discuss legal responsibilities of the designers, producers, owners, and operators of an autonomous system. We are concerned with behavioural aspects of such systems, and questions arising for regulatory bodies from the increasing autonomy.
In Section 2.3 we discussed the use of standards for verification, concluding that current approaches to certification and regulation are not adequate for verification of autonomous systems. In this section we briefly consider what would be needed from regulators in order to allow the standards to be used to verify autonomous systems.
A key issue is that current standards are not in a form that is amenable for formalisation and assessment of software, since they are oriented solely for use by humans. One way in which regulations are oriented towards humans, and do not readily support regulation of software, is that regulations are framed declaratively: a collection of statements that require substantial (human) interpretation. Another is that the regulations implicitly assume, and take for granted, human capabilities and attitudes. In order to certify autonomous software we need the scope of regulation to include not just low-level physical operation and sensors, but also higher-level decision-making. Finally, it would also be desirable for the plethora of relevant standards to be rationalised and consolidated. Consequently, it may be desirable to develop separate (new) standards for the assessment of software systems (e.g., software autopilots). At a high level, regulations should answer the following questions.
What does it mean for the system to be reliable/safe? The answer to this question is a set of specifications, or the union of the following:
What are the regulations the system must obey? For example, the automated air traffic control system must always send a resolution to avoid two planes getting too close to each other whenever this is a possibility.
What emergent behaviours are expected? For example, the automated air traffic control system should keep the airspace free of conflicts.
What would be bad? For example: the assistive robot should never cause harm to a human; the Therac-25666The Therac-25 was a computer-controlled radiation therapy machine, involved in at least six accidents between 1985 and 1987, where patients were given radiation doses hundreds of times greater than normal, resulting in death or serious injury . should never deliver radiation to a patient when it was not activated by hospital staff; and the automated air traffic control system should never instruct two planes to collide with each other. These are often assumptions that can be hard to list. They are also negative regulations, i.e., complying with these is implicit in the regulations. We need to explicitly state them to enable robust verification efforts. Certification of autonomous systems goes in two directions: we need to know both that the system does what we want and that the system does not do what we do not want. This is particularly important for autonomous systems since the ‘obvious’ things that a human operator would know to never do tend to not be explicitly captured, but can be behaviours that a system should (or must) avoid.
How ‘busy’ will the system be? The answer to this question can be in the form of minimum/maximum throughputs, real-time bounds, or other measures. Essentially, the specifications need some environmental context. For example, the automated air traffic control system may vacuously assert that it can always keep the airspace free of conflict by grounding all aircraft, or limiting the number of flights. However, if the specification includes an indication of the minimum level of traffic that is expected (e.g., all flight take-off requests must be granted within a reasonable time bound modulo specific exceptions), then this can prevent the autonomous system from setting such inappropriate limits or learning undesirable behaviors. Such information, provided by regulators, might include bounds on how many aircraft need to be able to fly in the airspace, the maximum allowable wait time to be cleared to fly given safe environmental conditions, etc.
Finally, specifications need to be compositional: they can be low-level and apply to one particular software routine or high-level and apply to the high-level architecture of the system. Because verification efforts are organised compositionally, as is safety cases coordination, there is a need to organise and divide the above list of types of specifications for each level/system component.
3.3 A Process for Identifying Requirements for Certification
We now present a simple process that can be used to provide guidance in identifying properties that need to be specified as verification properties for certification. The key idea is that, if the autonomous system is performing tasks that are currently done by humans, then knowledge about how these humans are currently licenced can be used to help identify requirements. So, if the humans currently performing the task require some form of licensing (e.g., driver’s license, pilot’s license, medical license, engineering certification, therapy certificate, etc.), then carefully considering what the licensing process assesses and, then, how this might be assessed for an autonomous software system, would move a step towards their certification.
A motivating insight is that domains most likely to require (or benefit) from regulation and certification of autonomous agents are those domains where humans are very likely to have to be appropriately certified.
One challenge is that software and humans are very different in their abilities. Certain assumed characteristics of humans, such as common sense, or self-preservation, will need to be explicitly considered and assessed for software, even though they may not be assessed at all for humans. But even when a characteristic of the humans is assessed as part of a licensing regime, it may well need to be assessed in a different way for an autonomous software system. For example, a written exam to assess domain knowledge may work for humans, since limited memory requires the human to be able to reason about the domain to answer questions, but would not work for a software system that could merely memorise knowledge without being able to apply it.
We consider four key areas:
the licensing that is used for humans;
the assumed human capabilities (often unassessed) that are relevant;
the relevant laws and regulations, and what justifiable deviations might exist; and
the interface that artefacts (e.g., a cockpit) used by humans (and hence to be used by autonomous software systems replacing humans) presents.
We now discuss these aspects in turn, beginning with licensing.
We now consider some of the qualities that licensing might assess, and for each indicate how we might assess this quality for an autonomous software system that is replacing a human:
Physical capabilities (e.g., can execute the sequence of fine-tuned adjustments required to land a plane) – this can be assessed for autonomous software by simulation, and assessing specific component sub-skills.
Domain knowledge (e.g., does the human know all of the protocols for safe operation, how to read and interpret domain-specific updates like Notices to Airmen (NOTAMs777A NOTAM is “a notice distributed by means of telecommunication containing information concerning the establishment, condition or change in any aeronautical facility, service, procedure or hazard, the timely knowledge of which is essential to personnel concerned with flight operations.” )) – for an autonomous software system, this would need to assess the ability to apply (operationalise) the domain knowledge in a range of scenarios designed to require this domain knowledge in order to behave appropriately. Note that this assessment could be in the form of a test (running some scenarios and observing the resulting behaviour), or in the form of formal verification (showing that certain properties always hold).
Regulatory knowledge (e.g., does the human know all of the rules, such as restrictions on flying in different classes of airspace) – this can be tested similarly to domain knowledge.
Ethical normalisation (e.g., does the human understand the importance assigned to the hierarchy of regulations from the regulatory body such as the FAA). An example would be that if an Unmanned Aerial System (UAS) is going to crash, the remote human operator needs to understand that it is better to clearly communicate to the appropriate air traffic control authority that the UAS will violate a geofence surrounding an unpopulated area and proceed to do that. The alternative, obeying the geofence but crashing the UAS into a populated area, is not acceptable – for autonomous software, one could verify certain priorities, if the reasoning is explicitly represented in a decision-making component, or assess performance in scenarios designed to present these sort of ethical challenges.
Assumed human capabilities:
Various characteristics of humans may not be assessed at all, but simply assumed, since they are sufficiently universal, and, for some characteristics, it is very clear if they are absent (e.g., a human lacking physical mobility (lacking requirements for physical capabilities), or being a child (lacking requirements for advanced ethical normalization) would be clear without requiring explicit assessment). Specifically, in considering assessment of autonomous software systems, we would want to carefully consider what human characteristics are required and assumed to hold, without any assessment. Typical questions might include:
For example, we assume a pilot knows that, when flying a passenger plane, the passengers require a certain amount of time to fasten seat belts.
Does the human need assumed (but untested) physical capabilities? For example, a pilot can sense when something is wrong from certain sounds or vibrations in a plane that may not be present in simulations or ground tests for certification.
Does the human need a certain level of maturity or life experience? For example, a sound may be readily identifiable as ‘bad’ even if the pilot has never heard it previously.
Does it assume basic properties of human values/ethics, such as that the pilot would never crash the plane on purpose because the pilot has a strong inclination toward self-preservation? Does it assume an operator would never choose to hurt others?
On the other hand, certain human characteristics that need to be assessed when certifying humans may not need to be considered when certifying software. For instance, psychological state, or personality aspects (such as being aggressive, having a big ego and therefore being over-sensitive to criticism of flying ability, or being impulsive) should not need to be considered for autonomous agents.
Often licensing includes testing for knowledge of relevant laws and regulations. We consider legal factors separately because this is a key source of specifications, and because the licensing assessment may not cover all relevant regulations and laws.
As per the three layers framework, we need to identify not just the ‘by the book’ rules (e.g., regulations and laws). Rather, we also need to consider situations where these rules may need to be (justifiably) overridden, and the principles that would need to be used to make such decisions. The key questions are: in what situations should the rules be overridden? How can the system identify these situations? And how can the system decide what to actually do in these situations? More specific questions to be considered include:
What situations could lead to outcomes considered to be ‘bad’, ‘unsafe’, ‘insecure’, or otherwise universally to-be-avoided if at all possible? And how bad are each of these? Are some worse than others? If there is a choice, is there any ranking or cost (possibly probabilistic) that can be associated with each? For example, if an autonomously-operating UAS is going to crash and must choose between crashing into a pile of trash on a curb or the car parked next to the pile of trash, the cost function would be expected to steer the crash landing toward the pile of trash. This could be defined in terms of minimising the repair cost of the affected property. One might furthermore define the cost of harming a human in a crash as higher than any situation where a human is not harmed.
Are some rules more important to obey than others? Are there divisions of the rules into hard constraints versus soft constraints? Is there some partial ordering on the importance of the rules?
Are there any acceptable reasons to break the rules?
In order to develop a system that can meet the requirements, we need to also consider what are the computational requirements of the system. What does it need to be able to measure, deduce, or decide?
Note that context is often left unspecified but it importantly restricts the applicability of licensing. To take context into account, we identify the licensing requirements, and then, for each requirement, consider whether it is relevant for the system. For example, for an auto-pilot that is only used while the plane is at cruising altitude, we would not need to consider requirements related to landing, or interaction with ground obstacles, even though these are requied for a human to earn a pilot’s license.
There is also a collection of similar factors that relate to the interface that a human currently uses to interact with artefacts. Such artefacts are designed for human use, and if a human performing the task is going to be replaced by an autonomous software system, then whether the existing interface presented by the artefact embodies assumptions about humans should be taken into account. Specifically:
Does the interface assume physical shapes of humans, such as being operated by a human hand?
Does it assume physical limitations of humans, such as having a minimum reaction time or a maximum speed for selecting multiple controls in sequence?
Does it assume mental limitations of humans, such as taking into account that a human cannot instantly take control of a system but requires orientation to the operational context to make reasonable decisions?
Does it assume that human-type faults will occur, such as being designed to avoid human confusion modes?
Does it assume that common sense deductions on the part of the operator are automatic? For example, a human pilot will automatically notice if the wing detaches from the aircraft and there is no explicit requirement that the aircraft operator must continuously ensure the aircraft has both wings but an autonomous system would need to be explicitly designed to consider this a fault (not just the instability it causes) and, e.g., avoid future control decisions that would only work if the aircraft had two wings . There is not a sensor for every ‘obvious’ fault yet an autonomous system needs to account for all faults that are obvious to humans, even when they trigger fault warnings for unrelated errors.888This is not an artificial example: there was a case of an engine exploding, resulting in a large number of seemingly unrelated alerts . Thankfully, in that instance, the human pilot realised what had happened, and was then able to safely land the plane.
There are three options to deal with a situation where a human is being replaced (partially or completely) with software, and the human interacts with an existing artefact using an interface. These options are: to retain the interface, and have the software interact with it, to extend the artefact with additional interfaces for the software, or to replace the artefact’s interface completely. The process for identifying requirements is summarised in Figure 7.
3.4 Verification of Autonomous Software Systems
In Section 2.2 we observed that the most solid and reliable way to certify autonomous software systems would be the adoption of formal methods.
Formal methods are mathematically rigorous techniques for design, specification, validation, and verification of a wide variety of systems. They contribute to certification in a number of ways, from requirements design, to checking that all requirements can be true of the same system at the same time (before the system is built) , to verifying that early (or even partial) designs always satisfy requirements, to generating test cases, to checking during system runtime that the system is still meeting requirements.
In order to verify that a system operates correctly, we need to know how the system works. If we do not have sufficient information on both how the system operates and what it means for the system to operate safely, then it is impossible to verify that these two behaviour spaces match up with each other. Knowing how the system works includes knowing sufficient implementation details to be able to certify correctness.
Though each method works differently, intuitively all formal methods verify that the system – or a model of the system – does what we expect it to do and nothing else. This capability is required for certification, e.g., via safety cases per the standards of the aerospace industry, where certification requires making a case, backed by evidence, that the system meets standards for safety and reliability.
In some formal methods, such as model checking and theorem proving, both the system under certification and the properties to be verified are modelled using a rigorous, mathematical or logical language. We indicate these methods as formal at the property & system level. The system model itself, however, is necessarily an abstraction of the real system and hence it is incomplete: any results from methods that operate, albeit in a rigorous way, on abstractions of real systems, should be validated against the actual environment. (Note that many modern model checkers and theorem provers are now capable of generating software from their proven models that can be used as a basis for, or as the entire, real system implementation thus enabling straightforward validation.) Other methods model the properties to check using rigorous formalisms and languages, and check the property against the real system under certification. We define these methods, which include some static analysis approaches and runtime verification, as formal at the property level. Semi-formal methods specify either the system or the properties that they should meet using languages with informal or semi-formal semantics. UML  and Agent-UML  are examples of semi-formal specification languages, as well as test cases. Informal methods are based on specifications provided in natural language and are out of the scope of our investigation.
In this section we review five verification methods: model checking, theorem proving, static analysis, runtime verification, and systematic testing. The first four of these are usually categorised as formal or semi-formal methods. Software testing is not a formal method, but it is one of the most widely adopted verification approaches in industry, often associated with quality assurance. The ‘Software Testing Services Market by Product, End-users, and Geography – Global Forecast and Analysis 2019-2023’  foresees that the software testing services market will grow at a compounded average growth rate of over 12% during the period 2019-2023. Software testing – once automated and seamlessly integrated in a development method, as in the ‘continuous integration model’ – may indeed represent a useful complement to formal methods. Testing can be measurably improved by using artefacts from formal verification for test-case generation. However, as is well-known, testing is not exhaustive, which limits its utility.
A method is exhaustive when it verifies the entire behaviour space, over all possible inputs, including for an infinite input space. Exhaustive verification includes proving both the existence of ‘good’ or required behaviours and the absence of ‘bad’ or requirements-violating behaviours, something that can never be demonstrated through testing or simulation. Being exhaustive refers to the capability of exhaustively exploring the state space generated by the system model, as it is not currently possible, except for a small class of systems with specific characteristics, to be exhaustive with respect to a system of realistic industrial size.
A method is static when it is applied on the system’s code, or on its abstract/simplified model, without needing the real system to be executed. A method is dynamic when it operates while the system runs.
This method performs an exhaustive check that a system, given in a logical language, always satisfies its requirements. Intuitively, model checking explores all possible executions of the system, checking the system model cannot reach states that violate the provided verification properties.
Outputs: (1) an automated (push-button) proof (usually by assertion, without the full proof argument) that the system always satisfies the requirement; or (2) a counterexample, which is an execution trace of the system from system initialisation that shows a step-by-step path to a system state where the system has operated faithfully to its definition yet still violated the requirement.
Automated. Counterexamples are excellent maps for debugging and are generated without guidance from the user.
Exhaustive. Verification reasons over the entire state space.
Verification of Absence of Bad Behaviours. Model checking both gives the existence proof that a system always does what we expect given the inputs and that the system never does something we don’t expect for any input. In other words, given an appropriate specification, it verifies the absence of behaviours that we don’t want, whether they are triggered by an input or no input at all.
Incremental. Model checking can efficiently analyse partial models/partial systems; can be used from the earliest stages of the design process to save time pursuing designs that cannot satisfy their requirements based on the partial design-so-far.
Weaknesses: Model checking is garbage-in, garbage-out; the results are only as accurate as the input system/model and the input requirement. One approach for validating the results of model checking against the actual environment by recognising assumption violations at runtime has been proposed in [80, 81], but the example addressed there is simpler than any real autonomous system. Modelling a system in a logical language and specifying a requirement in a temporal logic are the biggest bottlenecks to the use of model checking ; both tasks are difficult to complete and to validate. It is easy to create models that are too large to analyse with current technology, such as large pieces of software or models over large numbers of variables. Not all logical systems can be specified in languages that can be model-checked.
Applicability: model checking is widely used for both hardware and software, such as that seen in the Rules Layer [18, 51, 52, 98]. System protocals are particularly amenable to model checking; it has successfully verified real-life, full-scale communication protocols [175, 67], air traffic control systems [239, 168, 87], wheel braking systems , and more. Model checking has been used for the Reactions Layer, but quickly reduces to either hybrid model checking or the verification of coarse abstractions [3, 6, 95, 150]. Hybrid model checking can involve a wide array of numerical techniques aimed at solving, for example, differential equations concerning control or environmental models. For the Principles Layer, model-checking has been used for verifying BDI agents [35, 36, 37], epistemic and temporal properties of multiagent systems [156, 157, 180], the agents knowledge, strategies and games , and general, high-level decision making  that has been extended to ethical principles [41, 63].
We note one important variation, program model-checking . Per its name, program model checking replaces the abstract model with the actual code/software system. It uses the same model-checking algorithms, but on symbolic executions of the system. The advantage of such an approach is that it simplifies model validation (e.g., proving the link between the model and the real program/system) since we directly check the real program. However, a disadvantage is that the use of techniques such as symbolic execution can make the checking process slow and complex. Finally, note that the Principles Layer model-checking of high-level, and ethical, decision-making described above [62, 63, 41] actually utilise program model checking over rational agents [64, 61].
A user creates a proof that a system satisfies its requirements that is checked by a known-to-be-correct proof assistant. Some automation of standard logical deductions is included in the proof assistant.
Inputs: A theorem asserting a requirement holds over the system; the same large proof includes relevant definitions of the system and (sub-) requirements and any necessary axioms.
Outputs: A computer-checked proof that the theorem holds. If no such proof can exist (e.g., because the theorem does not hold), the user might gain the necessary insight into why this is the case from the way in which the proof fails.
Strengths: Theorem proving can provide strong type-checking and generate proof obligations that provide unique insight into system operation. The proofs generated are re-playable: they can be re-run by any future user at any time. If the input theorem changes (e.g., to perform a proof over the next version of the system) the old proof can be replayed (in some contexts) and only prompt the user for input where small changes are needed to update the proof to the new theorem. Proofs can be built up in extensible libraries and hierarchically inherit from/build upon each other. Large swaths of mathematics important to assertions over safety-critical systems have been proved in free libraries associated with the most popular theorem proving tools. Theorem proving can, and has, checked very large systems, even beyond what humans can inspect, such as the four-colour theorem .
Weaknesses: This formal method requires the most experience from users and has a comparatively high learning curve relative to other forms of verification. Setting up a proof correctly is a difficult activity even for trained experts. The user is the sole source of insight into why proofs fail; there is no automated counterexample like with model checking.
Applicability: Theorem-proving has been widely used for software seen in the Rules Layer, and techniques such as Hoare Logic  and tools such as Isabelle  are often used there. Theorem-proving has been used for the Reactions Layer, most notably verification of NASA’s aerospace systems via PVS [86, 183, 173, 174], verification of, e.g., automotive protocols through the KeyMaera system [186, 185], and the use of theorem-proving tools to prove control system stability . See  for an introduction to PVS specification for a Boeing 737 autopilot. Theorem-proving for the Principles Layer is less common, though it has been tackled .
During system design time, an automated tool examines a program or code fragment without executing it, to check for common bugs. Static analysis can be combined with techniques for expanding the types of bugs found, including symbolic execution (analysing a program’s control flow using symbolic values for variables to create a map of program execution with provable constraints), abstract interpretation (creating an abstraction of the program’s semantics that enables proofs about its execution), and shape analysis (mapping the program’s memory signature through its data structures).
Inputs: Program or code fragment to analyse
Outputs: List or visualisation of possible code defects
Strengths: Automated tools analyse the structure of the code rather than just paths through the code like testing; they can therefore provide a deeper analysis with better code coverage. Static analysis is faster and more reliable than the alternative of manual code review. Though performance depends on the programming language and static analysis tool, performing static analysis is usually totally automated with easy-to-understand output suitable for a non-expert in verification.
Weaknesses: Static analysers have no understanding of developer intent (e.g., they can detect that a function may access unallocated memory but will not find that the function does a different task than intended/labeled in the comments). They cannot enforce full coding standards, e.g., because some coding rules are open to interpretation, like those concerning comments, good documentation, and understandable variable names. The above ambiguities (in programmer intent, coding standards, use cases, etc.) can lead to false positive and false negative defect reports. False negatives mean that the analyser has failed to identify an error, can lead to fallacious trust in the software. False positives mean that the analyser reports an error where there is none, can cause a lot of wasted work at the developer’s side.
Applicability: Static analysis covers a range of different methods. Basic checks, i.e., whether each used variable was previously initialised, are included in most modern compilers. More sophisticated checks, e.g., whether pointer access may lead to an exception, are realised by widely used tools such as Lint . High-end static analysers such as Polyspace  and Astrée  are able to check for general assertions. A review of 40 static code analysis tools updated to November 2019 is available via the Software Testing Help blog . According to CENELEC EN50128, static analysis is highly recommended for all safety-relevant software. Consequently, static analysis has been applied for most safety-critical software in avionics, railway and automotive; e.g., NASA regularly releases its own home-grown static analysis tools including (Java) Symbolic PathFinder, IKOS (Inference Kernel for Open Static Analyzers) , and C Global Surveyor (CGS) . Static analysis techniques generally analyze software in the Rules Layer. The Reactions Layer requires analysis techniques able to deal with continuous physical models, and the Principles Layer requires the ability to deal with potentially complex reasoning. Although static analysis can analyze autonmous systems’ software, its key weakness, being non-exhaustive, limits the utility of applying it.
Runtime verification (RV) is a semi-formal method aimed at checking the current run of the system. RV is most often run online (e.g., embedded on-board the system during operation), and stream-based (e.g., evaluating requirements continuously throughout system operation), though it can also be run offline (e.g., with recorded data for post-mission analysis).
Inputs: (1) Input data stream or streams containing time-stamped sensor or software values, e.g., sent over a system bus to the RV engine; (2) A requirement to verify, expressed in the form of a temporal logic formula, most commonly Mission-time Linear Temporal Logic (MLTL) [196, 160, 161] or First-Order Linear Temporal Logic (FOLTL) [68, 21] whose variables are set by the input data or operationally via finite state automata, trace expressions , etc. (see Sect 2.1 of ).
Outputs: For online, stream-based RV, the output is a stream of tuples containing a time stamp and the verdict from the valuation of the temporal logic requirement (true or false), evaluated over the data stream starting from that time step. Variations on RV include extensions to non-Boolean (e.g., Bayesian or otherwise probabilistic) results, and evaluations of only part of the data stream of interest.
Strengths: RV is the only formal verification technique that can run during system deployment. Sanity checks can provide near-instant confirmation that a system is obeying its requirements or that a failure has occurred, providing an efficient mitigation trigger. Like simulation, RV can also provide useful characterisations of system runs, though RV analyzes runs in terms of violations of formal specifications; see  for a detailed comparison.
Weaknesses: Specification of the requirement to monitor at runtime is the biggest bottleneck to successful deployment of RV . Specifications are difficult to write for the same reasons as for model checking; further complications include noisiness of sensor data and challenges of real-world environments. Limitations and constraints of embedded systems and certification processes challenge RV implementations, though to date two tools have risen to these real-world-deployment challenges and more are sure to follow [203, 4]. Of course, violation of certain critial safety properties is not acceptable, even if this is detected. Since RV operates at (or after) runtime, it is not suitable for such properties.
Applicability: RV is widely used for software in Rules Layer as discussed for example by Rosu12  and by HavelundR17 ; see TRV12  for another example. RV of multiagent systems using trace expressions [10, 79] is also a viable approach to cope with the Rules Layer. RV has been used for the Reactions Layer, for example by BartocciBBMS17 . The R2U2 framework [196, 88, 208, 203, 172] is a real-time, Realisable, Responsive, Unobtrusive Unit for runtime system analysis, including security threat detection, implemented in either FPGA hardware or software. R2U2 analyzes rules and reactions spanning the Rules Layerand (with the addition of a parallel system model) the Reactions Layerof our reference framework. For the Principles Layer, DBLP:journals/ker/DastaniTY18  survey the use of norms for monitoring of the behaviour of autonomous agents: monitoring of norms is indeed foundational for processes of accountability, enforcement, regulation and sanctioning. The idea of a runtime ethical monitor, or governor, is also commonly used [16, 231].
Software Testing (ST) is not a formal method, though formal methods are often used for test-case geeration with provable qualities, such as coverage metrics. ST amounts to observing the execution of a software system to validate whether it behaves as intended and identify potential malfunctions . Testing can take place at different levels including Unit Testing, which tests individual units of source code; Integration Testing, aimed at testing a collection of software modules seen as a group; System Testing, carried out on a complete integrated system to evaluate its compliance with respect to its functional and/or system requirements; and User Acceptance Testing, meant to determine if the systems meets its requirements in terms of a specification or contract, and allow the user to accept the system, or not.
Inputs: Depending on the test level and on the testing technique adopted for the level, the inputs that feed the ST application can change dramatically. For example, Unit Testing tests source code, while System Testing may follow a black-box approach, with the stucture of the system unknown to the tester.
Outputs: As for RV, ST always consists of observing a sample of executions, and giving a verdict over them. The output may be an association between a test case, an execution run, and the respective verdict.
Strengths: ST can be performed in an agile way  and can be directly integrated in the software development method , allowing the software to be developed and tested in the same development cycle. ST techniques may be very efficient; coverage analysis and reliability measures can help in performing an ‘as complete as possible’ testing.
Weaknesses: Software Testing is not exhaustive. Identifying the most relevant units, functions, or functionalities to test is not easy, and finding the right balance between testing enough, but not too much, requires a deep understanding of the system under test. In particular, testing autonomous systems is hard since the range of possible behaviours that can be exhibited by an autonomous agent can be vast [229, 228]. The design and development of test cases is expensive; if the system under test changes, test cases developed for a previous version might not be reusable for the new one. Importantly, ST can never verify the absence of bad or undesirable behaviours, only the (partial) existance of desirable ones. Also unlike formal methods, ST requires a complete, executable system to test, pushing ST verification later in the design lifecycle than formal verification, which easily analyzes partial, early system designs.
Applicability: While software testing is commonly used in industry for both the Reactions Layer and Rules Layer, and is applied for testing features at both levels in autonomous [94, 209] and multiagent systems , we are not aware of proposals for testing the Principles Layer of an autonomous system.
The analysis that we carried out shows that no golden bullet exists for addressing the certification, via formal and semi-formal verification, of autonomous systems. All the methods considered in this section show strengths and weaknesses. Also, the applicability to the Principles Layer is not widespread yet. While a few proposals exist for some formal methods, the challenges to face are many and the initial ideas sprouting from the research community must still consolidate and turn into practical and usable applications.
4 Illustrative Examples
In this section we discuss some examples demonstrating the generality of the framework. The authors are aware that currently these systems are not likely to be certified; however, this might change in the future.
We exemplify systems that can operate independently, and argue that we can extend the framework for the case of operation in a group. For example, we can consider a single autonomous aeroplane, or a swarm of UAS; drive a car, or do car platooning; interact with a stand-along personal assistant agent (PAA), or use this as an interface to a complete smart-home environment.
Table 7 contrasts over several dimensions some examples of autonomous systems in various domains: the same systems/domains were illustrated in Fig. 5. Note that the second column of Table 7 indicates the (current or near future) scope of autonomy (based on Fig. 5), whereas the next column indicates the (somewhat longer-term future) level of autonomy.
We remind the reader of the levels of autonomy introduced in Section 1.1, Definition 1: no autonomy; low autonomy; assistance systems; partial autonomy; conditional autonomy; high autonomy; full autonomy.
Our examples are often quite distinct and, in particular, differ with respect to the complexity of their decision-making. By this, we mean the amount of information used to determine the system’s behaviour in any given situation. Usually, this complexity is known only to the designer of the autonomous system, it can not be observed from the outside. In the most basic form, decisions are based directly on inputs of certain sensors such as ultrasonic distance sensors, lidars, cameras, etc. A typical decision would be “if there is an obstacle to the left, then move to the right”. We consider such low-level decisions, where the action is derived directly from the sensory input, to be of low complexity. Additionally, however, the system can have some built-in knowledge concerning its environment that influences its behaviour; this could be map data, time tables, situation databases, rule books, etc. The internal state of an autonomous agent has been modelled in terms of its beliefs, intentions and desires. We consider algorithms which rely on extensive models of the environment to be of medium or high complexity. More sophisticated algorithms are taking the history of decisions into account, thus “learning from experience”. Such algorithms can adapt to changing environments and tasks, evolving constantly. Thus, even the designer of a system can not predict its behaviour with certainty, if the system has been operating for some time. We consider such decision making to be of very high complexity.
The examples differ also with respect to their potential harm, here defined generally as “negative outcomes” to individuals or society. One aspect of this is safety — the absence of physical harm. In the international standard IEC 61508 on functional safety, five safety integrity levels (SILs) have been defined. SIL0 means there are no safety related requirements, SIL1–4 refer to low, medium, high and very high safety related requirements, respectively. Other domains such as aerospace and automotive have come up with similar safety classifications. The categorization of a (sub-)system into a certain SIL class is done by considering the risk which the system imposes on its environment. This is the likelihood of occurrence of a malfunction multiplied by its possible consequences.
It is clear that systems/scenarios with higher levels of potential harm will require strong regulation. So, both in Table 7 and later subsections, we have highlighted the level of regulation available. In most cases, this regulation does not mention ‘autonomy’ nor consider that the human might not be “in control”. We would expect that systems with high potential risk will have stronger forms of regulation. The current amount of available regulation for each of the sample systems is indicated in Table 7, ranging from ‘low’ (scattered or unsystematic regulations; points of inconsistency or obsolescence in regulations) to ‘high’ (comprehensive regulatory framework; consistent and current regulations). In addition, once autonomy is introduced, and especially where this can impact upon safety or other potential harms, then enhanced regulations for this aspect will be essential. As yet, specific attention to aspects and implications of autonomy is rarely made.
|System||Scope of||Targeted Future||Complexity of||Potential||Amount of|
|Autonomy||Autonomy||Decision Making||Harm||Existing Regulation|
|Robot vacuum cleaner||low||high||low||none||low|
|Autonomous trading system||low-medium||high||low||high||low-medium|
|Driverless train||low-medium||full||medium||very high||high|
|Unmanned aerial system||medium||full||high||very high||medium|
|Personal assistant agent||high||full||very high||low-medium||low|
|Home assistant robot||high||high||very high||medium||low|
In the following, for each example we first describe the functionalities, and then position the functionalities of the example w.r.t. the three layers in our reference autonomy framework. We then describe the trends in the domain, in particular around the future level of autonomy per Definition 1. Subsequently, we comment on the safety-criticality of the example, and hence the amount of likely needed and available regulation. Finally, we comment on suitable verification, validation, and analysis techniques for the example.
4.1 Simple case: Robot Vacuum Cleaner
Functionalities: We begin with a simple example: a domestic robot tasked with simple objectives, such as vacuuming or sweeping the floor, or cutting grass in the lawn.
Positioning w.r.t. the layers in our reference autonomy framework: In terms of the three layer model, this sort of domestic robot illustrates that in some situations, not all of the layers are needed. There is, of course, a need for a Reactions Layer, which deals with physical aspects of the environment (sensing, positioning, locomotion, and physical actions). There is also a need for a Rules Layer that follows well-defined rules for ‘normal’ situations. Such rules might specify a regular schedule for cleaning, as well as indicating when this schedule might be abandoned or changed (e.g., due to human request, or the system detecting a situation in which cleaning ought to be postponed, such as a dinner party). The vacuum cleaner must limit its maximum speed to avoid harm, change direction if it encounters an obstacle, stop if it detects a fault of any kind, and obey the boundaries of its electric fence. The Principles Layer of a vacuum cleaner is trivial, since the system’s functionality is relatively simple. Specifically, we do not expect the system to deal with any unusual situations that would require it to override the ‘normal situation’ rules. The cases when the robot is stuck, or jammed, or too hot due to unexpected warming, can be considered as ‘normal situations’ and dealt with by the simple ‘switch off’ rule. The only way the application of this rule could harm, would be to cause a human to stumble into the robot which stopped in an unexpected position, but in this scenario we are assuming that humans can cope with the presence of the robot in their home, which means, they have the ability to perceive and avoid it.
Level of (future) autonomy: According to our categorization, such robots are designed to operate highly autonomous in their respective environment. Human assistance is necessary only for the initial set up, for emptying the dustbin, refilling cleansing fluids, maintaining brushes or blades, and repositioning the robot in case it is stuck or out of battery power. However, the complexity of decision making in these devices is low: Often, the control programs are based on a few sensory inputs only. Many of these robots navigate the area in a random fashion, turning upon contact with another object, and moving towards the docking station guided by an infrared beam if the power is low. More sophisticated models can learn a map of their operating environment and use advanced path planning methods to traverse the whole area. There are no complex interactions with other devices or actors in the environment. The scope of autonomy is also low, given the limited autonomous functionality of these systems.
Safety criticality (of autonomous aspects): For this example we exclude the possibility of the tasks having any direct safety consequence (such as caring for ill humans, having responsibility for monitoring for injury, or providing medication), and we assume that the humans interacting with the robot are aware of its existence and can cope with it in a ‘normal’ way. Therefore, the safety criticality of the devices is low.
Amount of available regulation (for autonomous aspects): With respect to available regulation, there are just the general laws governing the use of electric equipment, batteries, etc.
Suitable verification, validation, and analysis techniques: Cleaning robots are often adopted as simple and affordable examples to show the features and potential of different programming languages, tools, and even verification and validation techniques . Recently, four of the latest premium robot vacuum cleaners were analyzed and compared by IoT experts according to security and privacy issues , but no formal techniques were used there. Concerns have arisen since some companies wanted to use data collected by such home devices — layout of the house, operating hours etc. — for marketing purposes. While we agree that all the techniques reviewed in Section 3.4 are suitable for checking that the robot Reactions Layer and Rules Layer behave as expected, their application on real systems is often considered not to be worth the effort, given the absence of safety issues. However, these devices make good exercises for academic education in these techniques.
4.2 Autonomous Trading System
Functionalities: Autonomous trading systems (ATSs) operate in financial markets, submitting buy and sell orders according to some trading strategy. ATSs have a history back to 1949, but came to widespread attention after the ‘flash crash’ on the US stock exchange in 2010. By 2014 it was reported that in excess of 75% of trades in public US security markets were by ATSs . An ATS submits market instructions, primarily in the form of a buy or sell order. It may decide when to do so, and the commodity, quantity, and bid or ask price. ATSs can operate in any market where they are permitted, such as a stock market or an energy market. In notable contrast to un-aided human traders, ATSs can operate high-frequency trading. The price prediction techniques and trading strategies vary. For example, the stock price of a single company might be predicted using a non-linear function approximator , and the trading strategy might be based on a set of rules.
Positioning w.r.t. the layers in our reference autonomy framework: With respect to the layers in our reference autonomy framework, we can locate the communications protocols and simple but fast responses to market stimuli (“if the price is below threshold, then issue a sales request”) at the Reactions Layer. More sophisticated trading rules and strategies, which take general market evolution and developments into account (“if there is a shortage of oil, then coal prices are likely to rise”), are located at the Rules Layer. Although we might anticipate a possible Principles Layer that reasoned about unusual situations, such as suspending the usual trading strategy in the face of a stock market crash, the speed of trading makes this unlikely, and in practice, we would not expect a Principles Layer to be used.
Level of (future) autonomy:
Current ATSs are already capable of operating autonomously, without moment-by-moment human intervention. Nonetheless, ATSs can be used also as an assistant by a human trader, for example to automatically build or maintain a position specified by the trader. It is likely that the degree of autonomous operation and the time-scale of operation will increase, whereas humans become less able to have oversight of the trading behaviour except at a high level. The scope of autonomy depends on the sophistication of the ATS, and varies from low (for relatively simple ATSs that follow simple trading rules), to medium for more sophisticated ATSs.
Safety criticality (of autonomous aspects): Given the disruption to financial markets from ATSs, they have a high level of potential harm, and hence are considered safety-critical. Issues of concern with respect to safety include software errors, herd trading, market fragility, manipulative market strategies, lack of scrutability, the need for an algorithmic ‘kill switch’, and fair market access.
Amount of available regulation (for autonomous aspects): Given the safety criticality of ATSs, the needed regulation is high. Regulation has increased since ATSs became a standard tool in the financial sector in the 21st century. In the US, the financial authorities introduced stricter rules following the 2010 crash. These pertain to algorithmic trading behaviour, but also to the use of ATSs more generally, and the responsibility of the companies using ATSs to have adequate supervision of ATSs .
Suitable verification, validation, and analysis techniques: These ATSs are essentially feedback control systems. As such, standard analysis techniques from control systems such as analytic stability proofs or algorithmic hybrid systems verification [95, 186] could be deployed. However, such approaches require a very good model of the environment, and its reactions, in which the trading system resides. In particular, the environment includes other ATSs, and due to commercial sensitivity, little is likely to be known about them. Without a precise environmental formalism, verification and validation techniques are of little value. Testing can be applied in this area but, again, strong verification cannot be achieved. Consequently, there is a danger of ‘run-away’ feedback loops that are impossible to predict beforehand.
4.3 Driverless Trains
Functionalities: Automatic railway systems for the transportation of people have been in use since the beginning of the 21 century. Generally, these systems operate on dedicated routes, which are protected from human access by special barriers. Moreover, often the systems are designed in a way such that a collision of trains is physically impossible. Therefore, reliability of these systems usually is very high, with failures being caused by mechanical fatigue rather than by software errors.
Positioning w.r.t. the layers in our reference autonomy framework: All modern rail vehicles have a Train Control and Management System (TCMS), which in conjunction with the on-board signalling computer (OBC) is responsible for the basic operations of the train. With respect to our three-level framework, TCMS and OBC realize the Reactions Layer. In the autonomous subway example described above, the Rules Layer is responsible for planning the journey, stops in the station, etc. A Principles Layer would be needed at least for unattended train operation, and even more for trains which calculate their own route on the tracks. There are many ‘unusual’ situations that can arise in such a scenario, which have to be dealt with by an Principles Layer.
Level of (future) autonomy:
Most present-day driverless trains, e.g., people-mover in airports, have almost no autonomy. They operate according to fixed time schedules, with fixed distances to other trains, and under tight supervision of a human operator. The situation becomes more interesting when looking at self-driving underground vehicles. Some of these trains can also drive on tracks used by conventional subway trains. Thus, the autonomous vehicle has to decide when to advance, and at which speed. In Nürnberg, such a system has been operating for over a decade, with more than 99% punctuality and without any accidents. A main benefit of the autonomous operation is that it allows for a higher train frequency than with human-driven trains. In underground train systems, the probability of an obstacle on the tracks is rather low. For trams, regional trains, or high-speed trains, this is different. People or animals may cross the tracks, and thunderstorms may blow trees onto the tracks. Therefore, one of the main tasks of a train driver is to supervise the journey and to brake if necessary. Other tasks include monitoring for unusual events, checking the equipment before the trip, and passenger handling. There is an increasing number of electronic driver advisory systems such as speed recommendation systems, driver information systems, and others. Currently, all these systems leave responsibility for the trip to the driver. However, there is no compelling argument that an automated system could not do this task, increasing the autonomy of the system even further. However, the scope of autonomy is not high, since the functionality of the system is constrained by its environment (running on tracks).
The norm IEC 62290-1:2014 (Railway applications – Urban guided transport management and command/control systems – Part 1: System principles and fundamental concepts) defines five grades of automation (GoA). These can be compared with the levels of autonomy given in Definition 1. The fourth grade is driverless train operation (DTO), where the only tasks of the (human) train attendant are to handle passengers, control the doors and to take over control in emergency situations. The fifth grade is unattended train operation (UTO), where there is no personnel from the train line operator on board. That means, the electronic systems need to be able to deal also with emergency situations.
Safety criticality (of autonomous aspects): Compared to other domains, railway homologation and operation is strictly regulated. For the approval of a particular system, the authorities rely on notified bodies, designated bodies and associated bodies. These are institutions that assess the conformity of rail equipment to certain standards such as Cenelec 50128 . The assessment processes are strongly schematic. For safety-critical software, known best practices, including formal methods, have to be applied.
Amount of available regulation (for autonomous aspects): However, the available regulation in the railway sector is not prepared to deal with higher levels of autonomy. For example, the movement authority, i.e., the permission to enter a certain block, is currently given by the roadside equipment or the railway control centre. There are research projects investigating whether it is possible to transfer this responsibility to the train; this would require safe localization of a train and reliable communication between trains. In such a truly autonomous setting, the ‘electronic driver’ not only has to ensure that the track is free from obstacles, but also that there are no other trains in its way. Currently, it is unclear how such a system could be officially approved.
Suitable verification, validation, and analysis techniques: Railway software must be extensively tested and verified before it can be put into use. Due to the cost and complexity, formal verification is done for functions of the highest criticality only. These are mainly the signalling and breaking functions. Testing is done on unit, integration, and system level: in contrast to the automotive domain, road tests are extremely expensive; thus, developers strive to limit the necessary road tests to a minimum. For autonomous driving, the problem becomes dramatically more severe, since it is not possible to test millions of different situations on the track.
4.4 Unmanned Aerial System
Functionalities: The term ‘Unmanned Aerial Systems’ (UAS) covers a wide range of airborne vehicles, from toy quadrocopters to military drones and autonomous missiles. Whereas for transportation of people an airplane without pilot is (yet) beyond discussion, the transport of goods and material by UASs is a thrilling perspective. However, the main public concern is that a falling UAS might harm people on the ground.
Positioning w.r.t. the layers in our reference autonomy framework: The mapping to our three layer framework is fairly clear, as is the need for all three layers. Rapid responses to the physical environment are provided by the Reactions Layer, and there is a need to be able to apply rules to usual situations (Rules Layer) as well as deal with unanticipated situations by applying principles (Principles Layer).
Level of (future) autonomy: Already now, human pilots have a wealth of support systems at hand: navigation systems, autopilots, traffic collision avoidance systems, etc. Vital flight information is recorded and transmitted to ground control. Cameras are available providing a visual image from the cockpit. Therefore, moving from a human pilot within an air vehicle, to a human pilot remotely controlling the vehicle, and then towards the vehicle controlling itself autonomously, is a very popular approach. Yet it is fraught with difficulties (as discussed in earlier sections, in particular Section 2). In the near future we expect the scope of autonomy to be somewhat limited (medium), whereas it is possible that longer-term we will see UASs with higher levels of autonomous functionality.
Safety criticality (of autonomous aspects): Even though we are not (yet) considering UAS that carry human passengers, the safety criticality of UAS is very high. This is for the simple reason that a significant malfunction might lead to a collision, either between the UAS and a ground object (e.g. building, people), or between a UAS and another aircraft (potentially carrying passengers).
Amount of available regulation (for autonomous aspects): To provide a practical route to certification, including for non-autonomous systems, a range of documents/standards have been provided , perhaps the most relevant being the FAA documents Software Considerations in Airborne Systems and Equipment Certification [190, 194], Formal Methods Supplement to DO-178C and DO-278A , and Design Assurance Guidance for Airborne Electronic Hardware . These standards provide directions toward the use of formal methods in the certification of (traditional) air systems, but say relatively little about autonomous (or even semi-autonomous) operation. The FAA has recently published official rules for non-hobbyist small UAS operations, Part 107 of the Federal Aviation Regulations  that certifies remote human operators for a broad spectrum of commercial uses of UAS weighing less than 25 kg.
Suitable verification, validation, and analysis techniques: One approach, discussed in Section 3.3, to the analysis and potential certification of unmanned, particularly autonomous, air systems is to begin to show that the software replacing the pilot’s capabilities effectively achieves the levels of competence required of a human pilot. For example, if we replace the ‘flying’ capabilities of a pilot by an autopilot software component, then we must prove the abilities of the autopilot in a wide range of realistic flight scenarios. While this corresponds to the operational/control Reactions Layer within our approach, we can also assess the decision-making against the required rules that UAS should follow. Pilots are required to pass many tests before being declared competent, with one such being knowledge of the ‘Rules of the Air’. In  the agent making the high-level decisions in an UAS is formally verified against (a subset of) the ‘Rules of the Air’. This type of analysis of the system’s Rules Layer is necessary, though not sufficient, to provide broader confidence in autonomous air systems. Human pilots possess strong elements of ‘airmanship’ and the ability and experience to cope with unexpected and anomalous situations. It is here that the Principles Layer must capture this, more complex, decision-making. An initial attempt at the formalisation and verification of ethical decisions in unexpected situations was undertaken in  wherein a simple ethical ordering of outcomes was used to cope with situations the UAS had no expectations of, and rules about.
4.5 Self-Driving Car
Functionalities: Currently, the development of self-driving cars is one of the main innovation factors (besides e-mobility) in the automotive industry. Advanced driver assistance functions such as adaptive cruise control, lane keeping, or parking, are available in many cars. Combining such driver assistance systems, a partial autonomous driving (e.g., driving on the highway for a limited time) can be realized. Although not yet being available to end customers, cars with conditional autonomy (e.g., driving on the highway for an extended distance or autonomous parking in a parking deck) are already state of the art. Every major manufacturer has concrete plans to bring highly automated cars to the market in the next few years, although there is some debate as to whether these plans are feasible. Intermediate steps towards increasing autonomy include vehicle platooning/convoying whereby vehicles commit to being part of a (linear) formation and cede control of speed/direction to the platoon/convoy leader [213, 28, 207].
Positioning w.r.t. the layers in our reference autonomy framework: In terms of our three-level categorisation, current ‘driverless’ cars essentially only have the lowest operational/control Reactions Layer to handle lane following, speed control, etc, while the human driver is responsible for following the rules and regulations concerning road use and for coping with unexpected situations or failures.
Once we move to the platoon/convoy (or ‘road train’) concept, the aim is that more of the (developing) regulations  are taken by the autonomous system. Again, the human driver will be responsible for unexpected situations and failures, and will also take over responsibility for road use once the vehicle leaves the platoon/convoy. Thus, in this case, a Rules Layer is required.
However, the fully autonomous vehicles that we expect in the future will require software to be able to handle not just operational aspects such as steering, braking, parking, etc, but also comprehensive regulations such as the “rules of the road”. Furthermore, the Principles Layer of our autonomous system must conform to some high-level requirements concerning unexpected situations. Of course these do not exist yet, but we might expect them to at least have some quite straightforward basic principles such as
avoid harm to humans where possible
if a problem occurs, safely move to an approved parking area
Of course, many more principles/properties will surely be needed.
Level of (future) autonomy: A long-term vision is that self-driving cars are available on demand. If some people need a transport service, they can call an autonomous taxi via a smartphone app. Thus, there is no necessity to buy and maintain one’s own car; resources are shared between many users. For this vision to become a reality, cars must be able to drive fully autonomous to any desired destination in all traffic situations, and must moreover be able to collaborate with the rest of the fleet to deliver an optimal service. As this vision is gradually realised, self-driving cars will have increasing scope of autonomy (eventually high), but in the nearer term we expect more limited scope of autonomy.
Safety criticality (of autonomous aspects): Clearly, cars have a high level of safety criticality. The development of automotive software must follow functional safety standards, such as ISO 26262.This standard also prescribes the safety requirements to be fulfilled in the design, development and validation. For the verification and validation of software functionalities on the Reactions Layer and Rules Layer, we regard the existing techniques and regulations as being sufficient.
Amount of available regulation (for autonomous aspects): While there is considerable hype suggesting that such vehicles are imminent, neither the automotive industry nor the regulators are at the point of being able to certify their reliability or safety. Global legal constraints, such as the Vienna Convention on Road Traffic , ensure that there must always be a person which is able to control the safe operation of the vehicle. It is a current political discussion how to amend these regulations. The US Department of Transportation has recently issued a “Federal Automated Vehicles Policy”. It includes a 15-Point Safety Assessment, including “Testing, validation, and verification of an HAV system”. For SAE levels 4 and 5, if the driver has handed the control to the vehicle, liability rests with the manufacturer. Thus, most car manufacturers follow a “trip recorder” approach: A black box specially protected against tampering is used to collect all relevant data, and can be used to assign liability in case of an accident. However, with current legislation liability rests on factors such as human reaction times, weather conditions, etc. It is an open question whether these factors can be transferred to the case that an autonomous agent is the driver instead of a human.
Suitable verification, validation, and analysis techniques: Concerning the extraction of requirements and the verification of these requirements on a specific vehicle, we can see how the operational/practical constraints will be derived from many of the existing requirements on cruise control, lane following, etc, while we will see refined requirements for the sensing/monitoring capabilities within the vehicle. Verifying all these aspects often involve a suite of (acceptable) testing schemes. In addition, it is clear that, just as the agent controlling an autonomous air vehicle can be verified with respect to the “rules of the air” (see above), the agent controlling an autonomous road vehicle can be verified against the “rules of the road” .
However, when it comes to the functionalities of the Principles Layer, the rules that a human driver is expected to follow need to be transformed into requirements suitable for an autonomous agent. This will also include sensor/camera reliability and confidence in obstacle/sign/pedestrian recognition, and will also feed in to risk analysis component within the agent. Example for such ‘Principles Layer rules’ are “if there is low perceived risk of collision, continue at the highest admissible speed”, and “if there is high perceived risk of collision, or low confidence in the perceptions, then slow down”. In fully autonomous vehicles, the reliability of the Principles Layer is a crucial aspect. It must be able to cope with unexpected events, failures, aggressive road users, etc. Thus, the extraction of suitable requirements and their validation is highly important.
4.6 Personal Assistant Agent
Functionalities: The now-ubiquitous personal assistant agent has origins earlier than the web [97, 99] and became mainstream with Apple’s Siri in 2010 . The vision for such a PAA agent is proactive, personalised, contextual assistance for a human user . A PAA agent is a software entity, although the services it connects to and can command may include physical services such as smart home devices. So far, PAAs are mainly voice-controlled assistants that react to user requests such as looking for information on the web, checking the weather, setting alarms, managing lights and thermostats in a smart home. Few apps classified as smart personal assistants connect to service providers, banks, social networks; 24me is one among them: according to its description on the web , “automatically tells you what you need to do and also helps you take care of your things from within the app”. This technology moves on rapidly, however, and Gatebox’s “Azuma Hikari” system999https://www.gatebox.ai/en/hikari promises a startling development:
“The Comforting Bride: Azuma Hikari is a bride character who develops over time, and helps you relax after a hard day. Through day-to-day conversations with Hikari, and her day-to-day behavior, you will be able to enjoy a lifestyle that is more relaxed”. [https://www.gatebox.ai/en/hikari]
Positioning w.r.t. the layers in our reference autonomy framework: The Reactions Layer for a PAA consists of the PAA connecting to local and web services. The services can affect the world and the user’s day-to-day life in both smaller and larger ways: from opening an app on a mobile device, to reserving a table at a restaurant; from changing the temperature in the user’s house to making a bank payment from her account. The Rules Layer, currently only realized in sophisticated systems, is responsible for the combination of such services, e.g., according to user preferences (“one hour after the scheduled dinner reservation, begin heating the house”). The Principles Layer would be not only responsible for exceptional and emergency circumstances (e.g., calling help when the human needs it), but also for giving advice in personal matters (e.g., whether to sign an insurance contract or not). The “Azuma Hikari” character has the potential for quite an array of autonomous behaviour but, not just setting out required behaviours, in Rules Layer, but even providing quite sophisticated Principles Layer behaviours.
Level of (future) autonomy: The level and scope of autonomy are both limited so far, but we see a huge potential for PAAs to become more and more autonomous, implementing Negroponte’s vision of the agent which “answers the phone, recognizes the callers, disturbs you when appropriate, and may even tell a white lie on your behalf. […] If you have somebody who knows you well and shares much of your information, that person can act on your behalf very effectively. If your secretary falls ill, it would make no difference if the temping agency could send you Albert Einstein. This issue is not about IQ. It is shared knowledge and the practice of using it in your best interests” . Again, the “Azuma Hikari” approach provides the possibility for quite a lot more autonomous behaviour.
Safety criticality (of autonomous aspects): Depending on what the PAA can control, it may actually be a safety critical system: if it can make financial transactions, it might unintentionally cause economic loss to its user; if it can send messages on behalf of, or even pretend to be its user, it might create severe damages to professional and personal relations. Indeed, if the PAA is given control over aspects of the user’s physical environment (such as locks on their home’s doors), the possibilities for harm become more significant. While not safety-critical, the “Azuma Hikari” assistant clearly has some potentially ethical issues  as captured in the BS8611, not least with the anthropomorphism. If we extend our view of safety beyond physical harms to more general psychological/ethical harms then there might well be issues here.
Amount of available regulation (for autonomous aspects): To our knowledge, there is no regulation of PAAs, other than general consumer electronics regulation and regulations in specific sectors which co-incidentally apply to the PAA’s services such as regulations concerning protocols for accessing banking institution services. While there is no single data protection legislation in the US, in Europe the General Data Protection Regulation (GDPR ) became operative in May 2018. GDPR regulates the processing by an individual, a company or an organisation of personal data relating to individuals in the EU. A PAA based in Europe should comply to the GDPR. As an example, it should not share information by making unauthorized unauthorized calls on behalf of its user and should not share sensitive data (pictures, logs of conversations, amount of money in the bank account, health records) of its user and his or her relatives. It should also take care of where these personal data are stored to ensure an acceptable level of protection. On the other hand, a PAA should be allowed to override privacy laws and call emergency services if it suspects its user is unconscious from an accident, suicidal, or otherwise unable to ask for help, and it should share health information with the first aid personnel. Relating to the ethical/psychological issues above, standards such as BS8611 and the IEEE’s P7000 series are of relevance.
Suitable verification, validation, and analysis techniques: Like any complex piece of software, the PAA should undergo a careful testing stage before release. However, we believe that the most suitable technique to prevent it from unsafe behaviour, is runtime verification. The financial transactions might be monitored at runtime, and blocked as soon as their amount, or the amount of the exchanged message, overcome a given threshold. Calls to unknown numbers should not necessarily be blocked, as they might be needed in some emergency cases, but they should be intercepted and immediately reported to the user. Commands to smart devices should be prevented, if they belong to a sequence of commands meeting some dangerous pattern.
4.7 Home Assistant Robot
The situation changes if we consider home assistant robots that are not only performing simple tasks, but also issuing medical reminders, providing simple physiotherapeutic treatment, and even offering ‘companionship’. Consider a robot such as the ‘Care-o-Bot’101010http://www.care-o-bot.de. A number of domestic assistants are already available. These span a range, from the above Robot Vacuum Cleaner right up to Care-o-Bot systems, and similar systems, such as Toyota’s Human-Support Robot Family111111www.toyota-global.com/innovation/partner˙robot. We will primarily consider the more sophisticated end of this spectrum.
Functionalities: Robots such as Care-o-Bot are intended to be a general domestic assistant. They can carry out menial cleaning/tidying tasks, but can also (potentially, given the regulatory approval) fetch deliveries from outside the home and provide food and drink for the occupant(s) of the house. Each such robot is typically mobile (e.g., wheels) and has at least one manipulator, usually an ‘arm’. Yet this type of robot can potentially go further, becoming a true social robot, interacting with the occupant(s) (e.g., what to get for lunch?) and engaging in social dialogue (e.g., favourite TV show?). These robots (again, with appropriate approval) also have the potential to measure and assess physical and psychological health. Clearly, with this wide range of functionalities and increasing autonomy, robots of this form should be both highly-regulated and will have many, non-trivial, decisions to make.
Positioning w.r.t. the layers in our reference autonomy framework: The need for all of our layers is clear. Any physical interaction with the environment, or movement within a house, will require a complex Reactions Layer. Similarly, one would expect the regulatory framework for such robotic systems to be quite complex and so, necessarily, the Rules Layer will also likely be comprehensive. In the future the combination of broad functionality, increasing autonomy, and responsibility for health and well-being of human occupants will surely lead to quite complex ethical issues [11, 169, 205, 50] and so to a truly non-trivial Principles Layer.
Level of (future) autonomy: In principle, the scope and level of autonomy can both be high but, as noted below, the regulatory frameworks limit this at present. In the future, home care robots might become much more than service providers. Especially with people who are isolated and find little human contact, these robots might begin to guide and accompany their owners, remind them to take their medicine or drink water, and even provide simple social companionship (for example, asking about relatives, about the news, or about TV programmes seen recently). Especially in these latter cases, the issue of trust is crucial [8, 206, 26], since no one will use a robot in this way if they do not trust them. It becomes a central issue for the Principles Layer to assess how much to build trust and how much to stay within legal/regulatory bounds. Doing what the owner wants it to do can build trust, whereas refusing to do something because of minor illegality erodes trust very quickly.
Safety criticality (of autonomous aspects): In addition to the various issues that apply to Personal Assistant Agents, the broader role of Home Assistant Robots makes them more safety critical. For instance, if medicine is administered incorrectly, or if a Home Assistant Robot failures to raise the alarm when a human has an adverse health event, then lives may be put at risk. Therefore, it appears that clear regulations specifically developed for such robots will be needed, alongside standards targeting exactly the issues concerning these installations. With the complex practical, social, and health-care potential, the range of standards needs to go beyond physical safety and on to issues such as “ethical risk” .
Amount of available regulation (for autonomous aspects): There appears to be very little current regulation specifically designed for autonomous robotics, let alone such autonomous domestic assistants. While there are some developing standards that have relevance, such as ISO 13482 , the regulatory oversight for domestic robotic assistants is unclear. In some cases they are treated as industrial robots, in some cases as medical devices (especially if health monitoring is involved), and in others they are straightforwardly treated as electro-mechanical devices. In some jurisdictions (e.g., the UK) responsibility for regulation falls to the local regulatory authority, whereas in others regulatory oversight is a national responsibility. Furthermore, the standards that are appealed to are not specific standards for autonomous domestic robots, but rather standards for software, robots, or medical appliances.
On particular challenge concerns learning new behaviours. These might be specified by the user or might result from analytical analysis by the robot. These learned behaviours might well be ‘new’, i.e., not part of the robot’s behaviour when initially deployed. In this case, there should be some mechanism to assess whether the new, learned behaviour is consistent with the prescribed regulations. However, this assessment must either take place offline, in which case there will be some ‘down time’ for the robot, or online, in which case we will need effective and efficient techniques for carrying out this assessment. In the case where the human has specified/requested some new behaviour that is ‘in conflict’ with the regulations, the Principles Layer becomes essential as it must make decisions about whether to employ, or ignore, this new behaviour.
Suitable verification, validation, and analysis techniques: Ignoring the analysis of standard robot manipulation and movement aspects, there has been specific work on the analysis of Reactions Layer [231, 151] and Rules Layer [65, 224] issues. As we would expect, the ethical issues concerning such (potential) robots are of great interest to both philosophers and formal methods [12, 26]. Interestingly, issues such as privacy and security are only gradually being considered here 
5 Future Challenges
This section looks ahead. Based on the discussion in previous sections, it defines a number of challenges. These are broken down into three areas, depending on the intended audience, i.e., who are the people who could make progress on the challenge? Specifically, we consider challenges to researchers, to engineers, and to regulators.
5.1 Research challenges
There are a number of challenges that we identify for the research community.
One group of challenges relates to the overarching question of how to extend from current verification settings and assumptions, to the settings and assumptions required? Specific questions include:
In domains that involve interaction with humans, especially close interaction, such as human-agent teams, there is a need for ways to elicit and model the capabilities and attitudes (e.g., goals, beliefs) of the humans.
Since autonomous systems, even in safety-critical situations, are increasingly likely to make some use of Machine Learning, how can we verify such systems? If a system is able to learn at run-time, then some sort of run-time verification will be needed, although there may also be a role for mechanisms that limit the scope of the learning in order to ensure that important desired properties cannot be violated.
How can we handle larger and more complex systems (scalability of verification methods)? This includes systems with potentially not just multiple but many agents.
Finally, how we can improve the way that we design and build systems to make them more amenable to verification? One possible approach is to use synthesis to build systems that are provably correct from verification artefacts.
In addition to challenges relating to extending verification to richer, and more complex, settings, there are also a number of research challenges that relate to the broader social context of verification. Perhaps most fundamentally, we need better methods for systematically identifying (i.e., deriving) properties to be verified. We have sketched (in Section 3.3
) an outline of such a process, but more work is required to flesh this out, and refine it through iterated use and evaluation. One particular challenge is dealing with unusual situations that are particularly hard to anticipate. Another area that poses research challenges is the possibility of using natural language processing to create structured formal (or semi-formal) requirements from textual artefacts (which we discussed earlier, in Section2.3.1).
There is also a big research challenge to develop ways to engineer systems that are transparent, and, in particular, systems that can explain themselves to users, regulators etc. While there is the whole subfield of explainable AI (XAI), here we highlight the particular needs of safety-critical autonomous systems – where explainability might need to focus less on explaining specific individual behaviours to users, and more on explaining how a system operates to regulators.
5.2 Engineering challenges
We now turn to engineering challenges. These are challenges that need to be addressed to practically engineer reliable autonomous systems, once the underlying research challenges have been adequately addressed.
An existing challenge that is exacerbated is the need to track assumptions and their provenance, and to manage the system as it evolves (maintenance). If requirements change, and the system is modified, how can we update the verification (as well as manuals and other documentation)?
We identified research challenges relating to extending the scope and setting to deal with ethical reasoning, and to deal with machine learning. Systems that perform ethical reasoning must be built; building systems that involve learning also has engineering challenges.
A research challenge that we identified was how to model humans, for example in the context of human-agent teamwork. There is a related engineering challenge which is how to design systems that ‘work well’ with humans. Note this is a design challenge, not a research challenge (there is a body of e.g., HCI research on this).
Another research challenge was using synthesis to create provably correct systems. A related engineering challenge is managing this process. In practical terms, even if we have technology that can generate a system (or parts of a system) from verification artefacts, there are engineering challenges in managing this process, and having a meta-system where the user clicks ‘go’ and the system is built from V&V artefacts.
We noted above a number of research challenges that relate to the broader social context. There are also engineering challenges that relate to this. The big one is linking verification of design artefacts to the broader needs of stakeholders, in particular regulators. Suppose that a software system for controlling a train is verified. How can this be used to provide regulators with a convincing safety argument? One particular aspect is trust in the process and tools: how do we know that the verification tools are themselves correct? (see the discussion of regulator challenges below)
Finally, just as there are research challenges in scaling to handle larger and more complex systems, so too there are engineering challenges that relate to verifying large and complex systems. These concern techniques for making effective use of computing resources (be they traditional high-performance computing, or cloud computing) in order to effectively verify systems, and manage this process.
Along with these challenges, we mention the difficulty that companies can have in finding qualified engineering staff .
5.3 Regulatory challenges
Finally, there are challenges for regulators. A number of challenges relate to the existence of multiple stakeholders, who may be in a complex state that combines cooperation (e.g., to create standards) and competition. Related challenges to this include: how to manage disclosure of information that is sensitive (e.g., industrial ‘secrets’), and how to reach consensus of involved stakeholders, e.g., car OEMs. This is particularly important if autonomous systems from different manufacturers have to collaborate, e.g., in car platooning.
More broadly, the regulatory landscape is complex, with multiple actors (governments, courts, companies, etc.) which poses challenges around how to manage differences between jurisdictions (or between regulators with overlapping domains of interest). If autonomous systems operate in different countries, they should comply to all national regulations of the involved countries. However, the legal requirements may differ, and even contradict each other. Here, multi-national agreements and treaties are necessary.
When we consider any form of ethical reasoning done by autonomous systems, then there is a challenge in how to obtain sufficient agreement amongst various stakeholders (e.g., government, civil society, manufacturers) about the specification of what is appropriate ethical behaviour, or even a clear delineation distinguishing between behaviour that is clearly considered ethical, behaviour that is clearly considered unethical, and behaviour where there is not a clear consensus, or where it depends on the underlying ethical principles and framework.
A related point is the legal notion of responsibility, which is a question for regulators, and for society more broadly in the form of governments and legal systems; see for instance  for an introduction.
Autonomous systems have great potential to transform our world. The substantial adoption of even just one type of autonomous system — self-driving cars — will substantially alter passenger transport and the geography of cities (e.g., enhancing mobility for those who cannot drive, and reducing the need for parking spaces). More fundamentally, autonomous systems change our relationship with technology, as technology and human society reach an “intimate entanglement” . However, ensuring that autonomous systems are fit-for-purpose, especially when their function is in any way safety-critical, is crucial for their adoption. This article therefore addressed the fundamental question: “How can the reliability of an autonomous software system be ensured?”
After reviewing the state-of-the-art, including standards, existing approaches for certification, and open issues, this article proposed a way forward towards a framework for certification of safety-critical autonomous systems. We presented a three-layer framework for structuring such systems, gave an indication of what is needed from regulators, and outlined a process for identifying requirements. We reviewed a range of verification techniques, considering their applicability, strengths and weaknesses with respect to autonomous systems, and illustrated the application of our framework in seven diverse application scenarios. Finally, in order to help move towards a detailed and usable framework for certification, we articulated a range of challenges for researchers, for engineers, and for regulators.
In addition to the specific challenges discussed in the previous section, there are also a number of other questions arising from the emergence of autonomous systems. These are cross-cutting, in that they pose challenges to all of the key stakeholders (researchers, engineers, and regulators).
How to deal with tacit knowledge, e.g., where humans learn by “feel” (e.g. pilots)? Should each individual autonomous system learn, or should the knowledge evolve in a population?
How to deal with security challenges, in particular, if the autonomous system cannot be continually supervised? Should an autonomous system have “self-defence” capabilities against humans?
How to deal with Quality of Service (QoS) requirements, in particular, for a large group of autonomous systems and many users?
How to deal with varying interpretations, inconsistent requirements, and missing context?
How to deal with contradicting requirements from different stakeholders? Is there a notion of “loyalty” for autonomous systems?
How to deal with changing standards, attitudes, morals?
We hope that the initial framework, and the specific challenges, can form a map in guiding the various communities along the path towards a framework for certification of reliable autonomous systems.
We thank the organisers and participants of the Dagstuhl 19112 workshop. Thanks to Simone Ancona for the drawings in Section 1. Work of Fisher supported by the Royal Academy of Engineering and, in part, through UKRI “Robots for a Safer World” Hubs EP/R026092, EP/R026084, and EP/R026173. Work of Rozier supported in part by NSF CAREER Award CNS-1552934, NASA ECF grant NNX16AR57G, and NSF PFI:BIC grant CNS-1257011.
-  24me Company. 24me Smart Personal Assistant.
-  A. Abate, J.-P. Katoen, and A. Mereacre. Quantitative Automata Model Checking of Autonomous Stochastic Hybrid Systems. In Proceedings of the 14th ACM International Conference on Hybrid Systems: Computation and Control (HSCC), pages 83–92. ACM, 2011.
-  F.-M. Adolf, P. Faymonville, B. Finkbeiner, S. Schirmer, and C. Torens. Stream runtime monitoring on UAS. In Proceedings of International Conference on Runtime Verification, pages 33–49, 2017.
-  R. Alexander, M. Hall-May, and T. Kelly. Certification of autonomous systems. In Proceedings of 2nd Systems Engineering for Autonomous Systems (SEAS) Defence Technology Centre (DTC) Annual Technical Conference, 2007.
-  R. Alur, T. A. Henzinger, G. Lafferriere, and G. J. Pappas. Discrete abstractions of hybrid systems. Proceedings of the IEEE, 88(7):971–984, 2000.
-  G. V. Alves, L. Dennis, L. Fernandes, and M. Fisher. Reliable Decision-Making in Autonomous Vehicles. In A. Leitner, D. Watzenig, and J. Ibanez-Guzman, editors, Validation and Verification of Automated Systems: Results of the ENABLE-S3 Project, pages 105–117. Springer International Publishing, Cham, 2020.
-  F. Amirabdollahian, K. Dautenhahn, C. Dixon, K. Eder, M. Fisher, K. L. Koay, E. Magid, A. Pipe, M. Salem, J. Saunders, and M. Webster. Can You Trust Your Robotic Assistant? In International Conference on Social Robotics, volume 8239 of LNCS, pages 571–573. Springer, 2013.
-  D. Ancona, A. Ferrando, and V. Mascardi. Comparing trace expressions and Linear Temporal Logic for runtime verification. In Theory and Practice of Formal Methods, volume 9660 of Lecture Notes in Computer Science, pages 47–64. Springer, 2016.
-  D. Ancona, A. Ferrando, and V. Mascardi. Parametric runtime verification of multiagent systems. In AAMAS, pages 1457–1459. ACM, 2017.
-  M. Anderson and S. L. Anderson. EthEl: Toward a principled ethical eldercare robot. In Proc. AAAI Fall Symposium on AI in Eldercare: New Solutions to Old Problems, 2008.
-  M. Anderson and S. L. Anderson. Machine Ethics. Cambridge University Press, 2011.
-  K. Appel and W. Haken. Every Planar Map is Four-Colorable, volume 98 of Contemporary Mathematics. American Mathematical Society, Providence, RI, 1989.
-  P. Arcaini, S. Bonfanti, A. Gargantini, A. Mashkoor, and E. Riccobene. Formal validation and verification of a medical software critical component. In 13th ACM/IEEE International Conference on Formal Methods and Models for Codesign, MEMOCODE 2015, Austin, TX, USA, September 21-23, 2015, pages 80–89. IEEE, 2015.
-  C. Areias, J. C. Cunha, D. Iacono, and F. Rossi. Towards certification of automotive software. In Proceedings of 25th IEEE International Symposium on Software Reliability Engineering Workshops ISSRE, pages 491–496, 2014.
-  R. C. Arkin. Governing lethal behavior: embedding ethics in a hybrid deliberative/reactive robot architecture. In Proceedings of 3rd ACM/IEEE international conference on Human Robot Interaction (HRI’08), pages 121–128, 2008.
-  AV-TEST Institute. Robot vacuums undergo a security check: trustworthy helpers around the house or chatty cleaning appliances?, 2019.
-  C. Baier and J.-P. Katoen. Principles of Model Checking. MIT Press, 2008.
-  W. Bao, J. Yue, and Y. Rao. PLOS One, 12(7):e0180944, 2017.
E. Bartocci, L. Bortolussi, T. Brázdil, D. Milios, and G. Sanguinetti.
Policy learning in continuous-time markov decision processes using gaussian processes.Perform. Eval., 116:84–100, 2017.
-  D. A. Basin, F. Klaedtke, S. Müller, and B. Pfitzmann. Runtime monitoring of metric first-order temporal properties. In Proceedings of 28th IARCS Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS’08), pages 49–60, 2008.
-  B. Bauer, J. P. Müller, and J. Odell. Agent UML: A formalism for specifying multiagent software systems. In P. Ciancarini and M. J. Wooldridge, editors, Agent-Oriented Software Engineering, First International Workshop, AOSE 2000, Limerick, Ireland, June 10, 2000, Revised Papers, volume 1957 of Lecture Notes in Computer Science, pages 91–104. Springer, 2000.
-  K. Beck. Test-driven development: by example. Addison-Wesley Professional, 2003.
-  M. Beedle, A. van Bennekum, A. Cockburn, W. Cunningham, M. Fowler, J. Highsmith, A. Hunt, R. Jeffries, J. Kern, B. Marick, R. C. Martin, K. Schwaber, J. Sutherland, and D. Thomas. Manifesto for agile software development, 2001.
-  S. Bensalem, V. Ganesh, Y. Lakhnech, C. Munoz, S. Owre, H. Rueß, J. Rushby, V. Rusu, H. Saıdi, N. Shankar, et al. An overview of SAL. In Proceedings of 5th NASA Langley Formal Methods Workshop. Williamsburg, VA, 2000.
-  M. Bentzen, F. Lindner, L. Dennis, and M. Fisher. Moral Permissibility of Actions in Smart Home Systems. In Proceedings of FLoC 2018 Workshop on Robots, Morality, and Trust through the Verification Lens, 2018.
-  T. Benzel. Analysis of a kernel verification. In Proceedings of 1984 IEEE Symposium on Security and Privacy, pages 125–133, 1984.
-  C. Bergenhem, Q. Huang, A. Benmimoun, and T. Robinson. Challenges of platooning on public motorways. In Proceedings of 17th World Congress on Intelligent Transport Systems, pages 1–12, 2010.
-  P. M. Berry, M. T. Gervasio, B. Peintner, and N. Yorke-Smith. PTIME: personalized assistance for calendaring. ACM Trans. Intelligent Systems and Technology, 2(4):40:1–40:22, 2011.
-  A. Bertolino. Software testing research: Achievements, challenges, dreams. In L. C. Briand and A. L. Wolf, editors, International Conference on Software Engineering, ISCE 2007, Workshop on the Future of Software Engineering, FOSE 2007, May 23-25, 2007, Minneapolis, MN, USA, pages 85–103. IEEE Computer Society, 2007.
-  A. Biere, K. Heljanko, and S. Wieringa. AIGER 1.9 and beyond. Available at fmv.jku.at/hwmcc11/beyond1.pdf, 2011.
-  D. Birnbacher and W. Birnbacher. Fully autonomous driving: Where technology and ethics meet. IEEE Intelligent Systems, 32(5):3–4, 2017.
-  R. Bloomfield and P. Bishop. Safety and assurance cases: Past, present and possible future – an adelard perspective. In C. Dale and T. Anderson, editors, Making Systems Safer, pages 51–67, London, UK, 2010. Springer.
-  G. Booch, J. Rumbaugh, and I. Jacobson. The Unified Modeling Language User Guide. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1999.
-  R. H. Bordini, M. Fisher, C. Pardavila, and M. J. Wooldridge. Model checking AgentSpeak. In The Second International Joint Conference on Autonomous Agents & Multiagent Systems, AAMAS 2003, July 14-18, 2003, Melbourne, Victoria, Australia, Proceedings, pages 409–416. ACM, 2003.
-  R. H. Bordini, M. Fisher, W. Visser, and M. J. Wooldridge. Model checking rational agents. IEEE Intelligent Systems, 19(5):46–52, 2004.
-  R. H. Bordini, M. Fisher, W. Visser, and M. J. Wooldridge. Verifying multi-agent programs by model checking. Autonomous Agents and Multi-Agent Systems, 12(2):239–256, 2006.
-  M. Bozzano, A. Cimatti, A. F. Pires, D. Jones, G. Kimberly, T. Petri, R. Robinson, and S. Tonetta. Formal design and safety analysis of air6110 wheel brake system. In International Conference on Computer Aided Verification, pages 518–535. Springer, 2015.
-  G. Brat, J. A. Navas, N. Shi, and A. Venet. IKOS: A framework for static analysis based on abstract interpretation. In International Conference on Software Engineering and Formal Methods, pages 271–277. Springer, 2014.
-  G. Brat and A. Venet. Precise and scalable static program analysis of NASA flight software. In 2005 IEEE Aerospace Conference, pages 1–10. IEEE, 2005.
-  P. Bremner, L. A. Dennis, M. Fisher, and A. F. T. Winfield. On Proactive, Transparent, and Verifiable Ethical Reasoning for Robots. Proceedings of the IEEE, 107(3):541–561, 2019.
-  S. Bringsjord, K. Arkoudas, and P. Bello. Toward a General Logicist Methodology for Engineering Ethically Correct Robots. IEEE Intelligent Systems, 21(4):38–44, 2006.
-  British Standards Institution. BSI web site.
-  British Standards Institution (BSI). BS 8611 – robots and robotic devices — guide to the ethical design and application, 2016.
-  R. Butler. An introduction to requirements capture using PVS: specification of a simple autopilot. Technical report, NASA Langley Technical Report Server, 1996.
-  Cambridge Academic Content Dictionary. Definition of ‘certification’.
-  Cambridge Business English Dictionary. Definition of ‘certification’.
-  Cambridge English Dictionary. Definition of ‘regulation’.
-  CENELEC. CENELEC - EN 50128 – railway applications - communication, signalling and processing systems - software for railway control and protection systems, 2011.
-  V. Charisi, L. Dennis, M. Fisher, R. Lieck, A. Matthias, M. Slavkovik, J. Sombetzki, A. F. T. Winfield, and R. Yampolskiy. Towards moral autonomous systems. ArXiv e-prints, Mar. 2017.
-  E. M. Clarke, O. Grumberg, and D. A. Peled. Model Checking. The MIT Press, 2000.
E. M. Clarke and B.-H. Schlingloff.
In A. Robinson and A. Voronkov, editors,
Handbook of Automated Reasoning, pages 1635–1790. Elsevier and MIT Press, 2001.
-  D. D. Cofer, J. Hatcliff, M. Huhn, and M. Lawford. Software certification: Methods and tools (Dagstuhl seminar 13051). Dagstuhl Reports, 3(1):111–148, 2013.
-  P. Cousot, R. Cousot, J. Feret, A. Miné, X. Rival, B. Blanchet, D. Monniaux, and L. Mauborgne. Astrée.
-  P. A. Currit, M. G. Dyer, and H. D. Mills. Certifying the reliability of software. IEEE Trans. Software Eng., 12(1):3–11, 1986.
-  R. P. d. Araújo, A. C. Mota, and S. d. C. Nogueira. Probabilistic analysis applied to cleaning robots. In 2017 IEEE International Conference on Information Reuse and Integration (IRI), pages 275–282, Aug 2017.
-  F. Dalpiaz, A. Ferrari, X. Franch, and C. Palomares. Natural language processing for requirements engineering: The best is yet to come. IEEE Software, 35(5):115–119, 2018.
-  M. Dastani, P. Torroni, and N. Yorke-Smith. Monitoring norms: A multi-disciplinary perspective. Knowledge Eng. Review, 33:e25, 2018.
-  D. C. Dennett. The Intentional Stance. MIT Press, Cambridge, MA, USA, 1989.
-  E. Denney and G. Pai. Tool support for assurance case development. Autom. Softw. Eng., 25(3):435–499, 2018.
L. A. Dennis.
The MCAPL Framework including the Agent Infrastructure Layer and
Agent Java Pathfinder.
The Journal of Open Source Software, 3(24), 2018.
-  L. A. Dennis, M. Fisher, N. K. Lincoln, A. Lisitsa, and S. M. Veres. Practical Verification of Decision-Making in Agent-Based Autonomous Systems. Automated Software Engineering, 23(3):305–359, 2016.
-  L. A. Dennis, M. Fisher, M. Slavkovik, and M. Webster. Formal Verification of Ethical Choices in Autonomous Systems. Robotics and Autonomous Systems, 77:1–14, 2016.
-  L. A. Dennis, M. Fisher, M. Webster, and R. H. Bordini. Model Checking Agent Programming Languages. Automated Software Engineering, 19(1):5–63, 2012.
-  C. Dixon, M. Webster, J. Saunders, M. Fisher, and K. Dautenhahn. ‘The Fridge Door is Open’ — Temporal Verification of a Robotic Assistant’s Behaviours. In Advances in Autonomous Robotics Systems (TAROS), volume 8717 of Lecture Notes in Computer Science, pages 97–108. Springer, 2014.
-  S. C. Dutilleul, T. Lecomte, and A. B. Romanovsky, editors. Proceedings of 3rd International Conference on Reliability, Safety, and Security of Railway Systems (RSSRail’19), volume 11495 of Lecture Notes in Computer Science. Springer, 2019.
-  S. Edelkamp, S. Leue, and A. Lluch-Lafuente. Directed explicit-state model checking in the validation of communication protocols. International journal on software tools for technology transfer, 5(2-3):247–267, 2004.
-  E. A. Emerson. Temporal and modal logic. In Formal Models and Semantics, pages 995–1072. Elsevier, 1990.
-  B. Espejo-García, J. Martínez-Guanter, M. Pérez-Ruiz, F. J. López-Pellicer, and F. J. Zarazaga-Soria. Machine learning for automatic rule classification of agricultural regulations: A case study in Spain. Computers and Electronics in Agriculture, 150:343–352, 2018.
-  European Committee for Electrotechnical Standardisation. CENELEC web site.
-  European Parliament. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation), 2016.
-  European Union Aviation Safety Agency. EASA web site.
-  FAA. Qantas flight 32, airbus a380-842, vh-oqa. Online: https://lessonslearned.faa.gov/ll˙main.cfm?TabID=1&LLID=83, November 2010.
-  Y. Falcone, S. Krstic, G. Reger, and D. Traytel. A taxonomy for classifying runtime verification tools. In C. Colombo and M. Leucker, editors, Runtime Verification - 18th International Conference, RV 2018, Limassol, Cyprus, November 10-13, 2018, Proceedings, volume 11237 of Lecture Notes in Computer Science, pages 241–262. Springer, 2018.
-  M. Farrell, M. Luckcuck, and M. Fisher. Robotics and Integrated Formal Methods: Necessity Meets Opportunity. In Proceedings of 14th International Conference on Integrated Formal Methods (IFM’18), volume LNCS 11023, pages 161–171. Springer, 2018.
-  Federal Aviation Administration. FAA web site.
-  Federal Aviation Administration. Title 14 code of Federal Regulations Part 145 approved training program – research and recommendations, 2004.
-  Federal Aviation Administration. Part 107: Operation and certification of small unmanned aircraft systems, 2016.
-  A. Ferrando, D. Ancona, and V. Mascardi. Decentralizing MAS monitoring with DecAMon. In AAMAS, pages 239–248. ACM, 2017.
-  A. Ferrando, L. A. Dennis, D. Ancona, M. Fisher, and V. Mascardi. Recognising Assumption Violations in Autonomous Systems Verification. In AAMAS, pages 1933–1935. International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, USA / ACM, 2018.
-  A. Ferrando, L. A. Dennis, D. Ancona, M. Fisher, and V. Mascardi. Verifying and Validating Autonomous Systems: Towards an Integrated Approach. In RV, volume 11237 of Lecture Notes in Computer Science, pages 263–281. Springer, 2018.
-  FINRA. Algorithmic trading: Rules. https://www.finra.org/rules-guidance/key-topics/algorithmic-trading#rules. Accessed 2019-10-15.
-  M. Fisher, L. A. Dennis, and M. P. Webster. Verifying Autonomous Systems. Communications of the ACM, 56(9):84–93, 2013.
-  C. Frauenberger and P. Purgathofer. Ways of thinking in informatics. Communications of the ACM, 62(7):58–64, 2019.
-  FreeBSD. lint – a c program verifier.
-  A. L. Galdino, C. Munoz, and M. Ayala-Rincón. Formal verification of an optimal air traffic conflict resolution and recovery algorithm. In International Workshop on Logic, Language, Information, and Computation, pages 177–188. Springer, 2007.
-  M. Gario, A. Cimatti, C. Mattarei, S. Tonetta, and K. Y. Rozier. Model checking at scale: Automated air traffic control design space exploration. In Proceedings of 28th International Conference on Computer Aided Verification (CAV 2016), volume 9780 of LNCS, pages 3–22, Toronto, ON, Canada, July 2016. Springer.
J. Geist, K. Y. Rozier, and J. Schumann.
Runtime Observer Pairs and Bayesian Network Reasoners On-board FPGAs: Flight-Certifiable System Health Management for Embedded Systems.In Proceedings of the 14th International Conference on Runtime Verification (RV14), volume 8734, pages 215–230. Springer-Verlag, September 2014.
-  S. Ghosh, D. Elenius, W. Li, P. Lincoln, N. Shankar, and W. Steiner. ARSENAL: automatic requirements specification extraction from natural language. In S. Rayadurgam and O. Tkachuk, editors, NASA Formal Methods - 8th International Symposium, NFM 2016, Minneapolis, MN, USA, June 7-9, 2016, Proceedings, volume 9690 of Lecture Notes in Computer Science, pages 41–46. Springer, 2016.
-  D. Gunkel and J. J. Bryson. Introduction to the special issue on machine morality: The machine as moral agent and patient. Philosophy & Technology, 27(1):5–8, Mar. 2014.
-  K. Havelund and G. Reger. Runtime verification logics A language design perspective. In Models, Algorithms, Logics and Tools - Essays Dedicated to Kim Guldstrand Larsen on the Occasion of His 60th Birthday, volume 10460 of Lecture Notes in Computer Science, pages 310–338. Springer, 2017.
-  C. L. Heitmeyer. On the role of formal methods in software certification: An experience report. Electronic Notes on Theoretical Computer Science, 238(4):3–9, 2009.
-  C. L. Heitmeyer, M. Archer, E. I. Leonard, and J. McLean. Applying formal methods to a certifiably secure software system. IEEE Trans. Software Eng., 34(1):82–98, 2008.
-  P. Helle, W. Schamai, and C. Strobel. Testing of autonomous systems – challenges and current state-of-the-art. In INCOSE International Symposium, volume 26-1, pages 571–584. Wiley Online Library, 2016.
-  T. A. Henzinger, P.-H. Ho, and H. Wong-Toi. HYTECH: A Model Checker for Hybrid Systems. International Journal on Software Tools for Technology Transfer, 1(1-2):110–122, 1997.
-  C. A. R. Hoare. An Axiomatic Basis for Computer Programming. Commun. ACM, 12(10):576–580, Oct. 1969.
-  K. Hodgkins. Apple’s Knowledge Navigator, Siri and the iPhone 4S. Engadget, 5 Oct. 2011.
-  G. J. Holzmann. The Spin Model Checker: Primer and Reference Manual. Addison-Wesley, 2003.
-  M. N. Huhns and M. P. Singh. Agents on the web: Personal assistants. IEEE Internet Computing, 2(5):90–92, 1998.
-  Industry Research. Software testing services market by product, end-users, and geography – global forecast and analysis 2019-2023, 2019.
-  Institute of Electrical and Electronics Engineers. The IEEE global initiative on ethics of autonomous and intelligent systems.
-  Institute of Electrical and Electronics Engineers. IEEE web site.
-  Institute of Electrical and Electronics Engineers. IEEE 1512-2006 – ieee standard for common incident management message sets for use by emergency management centers, 2006.
-  Institute of Electrical and Electronics Engineers. IEEE standard ontologies for robotics and automation, 2015.
-  Institute of Electrical and Electronics Engineers. P2020 – standard for automotive system image quality, 2016.
-  Institute of Electrical and Electronics Engineers. P7000 – model process for addressing ethical concerns during system design, 2016.
-  Institute of Electrical and Electronics Engineers. P7001 – transparency of autonomous systems, 2016.
-  Institute of Electrical and Electronics Engineers. P7002 – data privacy process, 2016.
-  Institute of Electrical and Electronics Engineers. P7003 – algorithmic bias considerations, 2017.
-  Institute of Electrical and Electronics Engineers. P7006 – standard for personal data artificial intelligence (AI) agent, 2017.
-  Institute of Electrical and Electronics Engineers. P7007 – ontological standard for ethically driven robotics and automation systems, 2017.
-  Institute of Electrical and Electronics Engineers. P7008 – standard for ethically driven nudging for robotic, intelligent and autonomous systems, 2017.
-  Institute of Electrical and Electronics Engineers. P7009 – standard for fail-safe design of autonomous and semi-autonomous systems, 2017.
-  International Association of Public Transport – L’Union internationale des transports publics. UITP web site.
-  International Civil Aviation Organization. Annex 11 to the convention on international civil aviation, thirteenth edition, 2001.
-  International Electrotechnical Commission. IEC TC 107 – process management for avionics.
-  International Electrotechnical Commission. IEC TC 97 – electrical installations for lighting and beaconing of aerodromes.
-  International Electrotechnical Commission. IEC web site.
-  International Electrotechnical Commission. IEC 62278 – railway applications - specification and demonstration of reliability, availability, maintainability and safety (RAMS), 2002.
-  International Electrotechnical Commission. Functional safety and IEC 61508, 2010.
-  International Electrotechnical Commission. IEC 62278-3 – railway applications - specification and demonstration of reliability, availability, maintainability and safety (RAMS) - Part 3: Guide to the application of IEC 62278 for rolling stock RAM, 2010.
-  International Electrotechnical Commission. IEC 62278-4 – railway applications - specification and demonstration of reliability, availability, maintainability and safety (RAMS) - Part 4: RAM risk and RAM life cycle aspects, 2016.
-  International Electrotechnical Commission. IEC TC 69 – electric road vehicles and electric industrial trucks, 2017.
-  International Electrotechnical Commission. IEC TC 9 – electrical equipment and systems for railways, 2017.
-  International Electrotechnical Commission. IEC TR 60601-4-1 – medical electrical equipment – part 4-1: Guidance and interpretation - medical electrical equipment and medical electrical systems employing a degree of autonomy, 2017.
-  International Electrotechnical Commission. IEC 63243 ED1 – interoperability and safety of dynamic wireless power transfer (WPT) for electric vehicles, 2019.
-  International Organization for Standardization. ISO web site.
-  International Organization for Standardization. ISO/TC 20 – Aircraft and space vehicles, 1947.
-  International Organization for Standardization. ISO/TC 76 – transfusion, infusion and injection, and blood processing equipment for medical and pharmaceutical use, 1951.
-  International Organization for Standardization. ISO/TC 194 – biological and clinical evaluation of medical devices, 1988.
-  International Organization for Standardization. ISO/TC 210 – quality management and corresponding general aspects for medical devices, 1994.
-  International Organization for Standardization. ISO/TC 215 – health informatics, 1998.
-  International Organization for Standardization. ISO 21500 – guidance on project management, 2012.
-  International Organization for Standardization. ISO/TC 269 – railway applications, 2012.
-  International Organization for Standardization. ISO 13482 – robots and robotic devices – safety requirements for personal care robots, 2014.
-  International Organization for Standardization. ISO/TC 299 – robotics, 2015.
-  International Organization for Standardization. ISO and road vehicles, 2016.
-  International Organization for Standardization. ISO 21245 – railway applications – railway project planning process – guidance on railway project planning, 2018.
-  International Organization for Standardization. ISO 26262-1 – road vehicles – functional safety, 2018.
-  International Organization for Standardization. ISO and health, 2019.
-  International Organization for Standardization (ISO). ISO 13482 – robots and robotic devices — safety requirements for personal care robots, 2014.
-  International Organization for Standardization (ISO). ISO/TS 15066 – robots and robotic devices – collaborative robots, 2016.
-  International Organization for Standardization (ISO). ISO/TR 20218-2 – robotics – safety design for industrial robot systems – part 2: Manual load/unload stations, 2017.
-  International Organization for Standardization (ISO). ISO/TR 23482-2 – robotics – application of ISO 13482 – part 2: Application guidelines, 2017.
-  International Organization for Standardization (ISO). ISO/TR 20218-1 – robotics – safety design for industrial robot systems – part 1: End-effectors, 2018.
-  International union of railways – Union Internationale des Chemins de fer. UIC web site.
-  O. A. Jasim and S. M. Veres. Towards formal proofs of feedback control theory. In Proc. 21st International Conference on System Theory, Control and Computing (ICSTCC), pages 43–48, Oct 2017.
-  N. R. Jennings, K. P. Sycara, and M. Wooldridge. A roadmap of agent research and development. Autonomous Agents and Multi-Agent Systems, 1(1):7–38, 1998.
-  J. Jovinelly and J. Netelkos. The Crafts And Culture of a Medieval Guild. Rosen Publishing, New York, NY, 2006.
-  A. Julius and G. Pappas. Approximations of stochastic hybrid systems. IEEE Transactions on Automatic Control, 54(6):1193–1203, 2009.
-  S. G. Khan, G. Herrmann, A. G. Pipe, C. Melhuish, and A. Spiers. Safe Adaptive Compliance Control of a Humanoid Robotic Arm with Anti-Windup Compensation and Posture Control. Int. J. Social Robotics, 2(3):305–319, 2010.
-  J. C. Knight. Safety critical systems: challenges and directions. In W. Tracz, M. Young, and J. Magee, editors, Proceedings of the 24th International Conference on Software Engineering, ICSE 2002, 19-25 May 2002, Orlando, Florida, USA, pages 547–550. ACM, 2002.
-  L. Kohlberg. Stage and sequence: The cognitive-developmental approach to socialization. In D. Goslin, editor, Handbook of Socialization Theory and Research, pages 347–480. Rand McNally, 1969.
-  L. Kohlberg. Essays on Moral Development. Volume I: The philosophy of moral development. Harper & Row, 1981.
-  L. Kohlberg. Essays on Moral Development. Volume II: The psychology of moral development: the nature and validity of moral stages. Harper & Row, 1984.
-  J. Kong and A. Lomuscio. Symbolic model checking multi-agent systems against CTL*K specifications. In K. Larson, M. Winikoff, S. Das, and E. H. Durfee, editors, Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2017, São Paulo, Brazil, May 8-12, 2017, pages 114–122. ACM, 2017.
-  J. Kong and A. Lomuscio. Model checking multi-agent systems against LDLK specifications on finite traces. In E. André, S. Koenig, M. Dastani, and G. Sukthankar, editors, Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2018, Stockholm, Sweden, July 10-15, 2018, pages 166–174. International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, USA / ACM, 2018.
-  N. G. Leveson and C. S. Turner. An investigation of the Therac-25 accidents. Computer, 26(7):18–41, July 1993.
-  D. M. Levine. A day in the quiet life of a NYSE floor trader. Fortune, 23May 2013.
-  J. Li and K. Y. Rozier. MLTL Benchmark Generation via Formula Progression. In Proceedings of the 18th International Conference on Runtime Verification (RV18), Limassol, Cyprus, November 2018. Springer-Verlag.
-  J. Li, M. Vardi, and K. Y. Rozier. Satisfiability checking for mission-time LTL. In Proceedings of 31st International Conference on Computer Aided Verification (CAV’19), LNCS. Springer, July 2019.
-  A. Lomuscio and F. Raimondi. Model checking knowledge, strategies, and games in multi-agent systems. In H. Nakashima, M. P. Wellman, G. Weiss, and P. Stone, editors, 5th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2006), Hakodate, Japan, May 8-12, 2006, pages 161–168. ACM, 2006.
-  M. Luckcuck, M. Farrell, L. Dennis, C. Dixon, and M. Fisher. Formal Specification and Verification of Autonomous Robotic Systems: A Survey. ACM Computing Surveys, 52(5):100:1–100:41, 2019.
-  K. S. Luckow and C. S. Păsăreanu. Symbolic pathfinder v7. ACM SIGSOFT Software Engineering Notes, 39(1):1–5, 2014.
-  F. M. Maggi, M. Montali, M. Westergaard, and W. M. P. van der Aalst. Monitoring business constraints with linear temporal logic: An approach based on colored automata. In S. Rinderle-Ma, F. Toumani, and K. Wolf, editors, Proceedings of 9th International Conference on Business Process Management (BPM’11), volume 6896 of LNCS, pages 132–147. Springer, 2011.
-  B. Marr. The biggest challenges facing artificial intelligence (AI) in business and society. Forbes, 2017.
-  MathWorks. Polyspace bug finder.
-  C. Mattarei, A. Cimatti, M. Gario, S. Tonetta, and K. Y. Rozier. Comparing different functional allocations in automated air traffic control design. In Proceedings of Formal Methods in Computer-Aided Design (FMCAD 2015), Austin, Texas, U.S.A, September 2015. IEEE/ACM.
-  A. Matthias. Robot lies in health care: when is deception morally permissible? Kennedy Institute of Ethics Journal, 25(2):279–301, 2011.
-  K. L. McMillan. The SMV language. Cadence Berkeley Labs, pages 1–49, 1999.
-  Merriam-Webster Dictionary. Definition of ‘reliable’.
-  P. Moosbrugger, K. Y. Rozier, and J. Schumann. R2U2: Monitoring and Diagnosis of Security Threats for Unmanned Aerial Systems. In Formal Methods in System Design (FMSD), pages 1–31. Springer-Verlag, April 2017.
-  C. Munoz, A. Narkawicz, and J. Chamberlain. A TCAS-II resolution advisory detection algorithm. In AIAA Guidance, Navigation, and Control (GNC) Conference, page 4622, 2013.
-  C. Muñoz, A. Narkawicz, G. Hagen, J. Upchurch, A. Dutle, M. Consiglio, and J. Chamberlain. Daidalus: detect and avoid alerting logic for unmanned systems. In 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC), pages 5A1–1. IEEE, 2015.
-  M. Musuvathi, D. R. Engler, et al. Model checking large network protocol implementations. In NSDI, volume 4, pages 12–12, 2004.
-  N. Negroponte. Being Digital. Random House, New York, NY, USA, 1996.
-  C. D. Nguyen, A. Perini, C. Bernon, J. Pavón, and J. Thangarajah. Testing in multi-agent systems. In M. P. Gleizes and J. J. Gómez-Sanz, editors, Agent-Oriented Software Engineering X - 10th International Workshop, AOSE 2009, Budapest, Hungary, May 11-12, 2009, Revised Selected Papers, volume 6038 of Lecture Notes in Computer Science, pages 180–190. Springer, 2009.
-  C. Patchett, M. Jump, and M. Fisher. Safety and Certification of Unmanned Air Systems. In Engineering and Technology Reference. Institution of Engineering and Technology, 2015.
-  L. C. Paulson. A Generic Theorem Prover, volume 828 of Lecture Notes in Computer Science. Springer, 1994.
-  W. Penczek and A. Lomuscio. Verifying epistemic properties of multi-agent systems via bounded model checking. In The Second International Joint Conference on Autonomous Agents & Multiagent Systems, AAMAS 2003, July 14-18, 2003, Melbourne, Victoria, Australia, Proceedings, pages 209–216. ACM, 2003.
-  R. Pietrantuono and S. Russo. Robotics software engineering and certification: Issues and challenges. In S. Ghosh, R. Natella, B. Cukic, R. Poston, and N. Laranjeiro, editors, 2018 IEEE International Symposium on Software Reliability Engineering Workshops, ISSRE Workshops, Memphis, TN, USA, October 15-18, 2018, pages 308–312. IEEE Computer Society, 2018.
-  E. Pietronudo. ”Japanese women’s language” and artificial intelligence: Azuma Hikari, gender stereotypes and gender norms, 2018. http://hdl.handle.net/10579/12791.
-  L. Pike. Modeling time-triggered protocols and verifying their real-time schedules. In Formal Methods in Computer Aided Design (FMCAD’07), pages 231–238. IEEE, 2007.
-  Current State of EU Legislation- Cooperative Dynamic Formation of Platoons for Safe and Energy-optimized Goods Transportation.
-  A. Platzer. Logical Analysis of Hybrid Systems: Proving Theorems for Complex Dynamics. Springer, Heidelberg, 2010.
-  A. Platzer and J.-D. Quesel. KeyMaera: A Hybrid Theorem Prover for Hybrid Systems. In A. Armando, P. Baumgartner, and G. Dowek, editors, Proceedings of 4th International Joint Conference on Automated Reasoning (IJCAR), volume 5195 of LNCS, pages 171–178. Springer, 2008.
-  J. H. Poore, H. D. Mills, and D. Mutchler. Planning and certifying software system reliability. IEEE Software, 10(1):88–99, 1993.
Retrospective semi-automated software feature extraction from natural language user manuals. PhD thesis, University of Heidelberg, Germany, 2018.
-  Radio Technical Commission for Aeronautics. RTCA web site.
-  Radio Technical Commission for Aeronautics. DO-178B – software considerations in airborne systems and equipment certification, 1992.
-  Radio Technical Commission for Aeronautics. DO-278A – software integrity assurance considerations for communication, navigation, surveillance and air traffic management (CNS/ATM) systems, 1992.
-  Radio Technical Commission for Aeronautics. DO-254 – design assurance guidance for airborne electronic hardware, 2000.
-  Radio Technical Commission for Aeronautics. DO-333 – formal methods supplement to DO-178C and DO-278A, 2011.
-  Radio Technical Commission for Aeronautics. DO-178C/ED-12C – software considerations in airborne systems and equipment certification, 2012.
-  B. Ramesh and M. Jarke. Toward reference models for requirements traceability. IEEE Transactions on Software Engineering, 27(1):58–93, Jan. 2001.
-  T. Reinbacher, K. Y. Rozier, and J. Schumann. Temporal-logic based runtime observer pairs for system health management of real-time systems. In Proceedings of 20th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS’14), volume LNCS 8413, pages 357–372. Springer, 2014.
-  D. J. Rinehart, J. C. Knight, and J. Rowanhill. Understanding what it means for assurance cases to “work”. Technical report, NASA, 2017. NASA/CR–2017-219582.
-  G. Rosu. On safety properties and their monitoring. Sci. Ann. Comp. Sci., 22(2):327–365, 2012.
-  K. Rozier and M. Vardi. LTL satisfiability checking. International Journal on Software Tools for Technology Transfer (STTT), 12(2):123 – 137, March 2010.
-  K. Y. Rozier. Linear Temporal Logic Symbolic Model Checking. Computer Science Review Journal, 5(2):163–203, 2011.
-  K. Y. Rozier. Specification: The biggest bottleneck in formal methods and autonomy. In Proceedings of 8th Working Conference on Verified Software: Theories, Tools, and Experiments (VSTTE’16), volume LNCS 9971, pages 1–19. Springer, 2016.
-  K. Y. Rozier. From simulation to runtime verification and back: Connecting single-run verification techniques. In Proceedings of the Spring Simulation Conference (SpringSim), pages 1–10, Tucson, AZ, USA, April 2019. Society for Modeling & Simulation International.
-  K. Y. Rozier and J. Schumann. R2U2: Tool overview. In Proceedings of International Workshop on Competitions, Usability, Benchmarks, Evaluation, and Standardisation for Runtime Verification Tools (RV-CUBES), pages 138–156, 2017.
-  SAE International. SAE J3016_201806 – taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles, 2018.
-  M. Salem, G. Lakatos, F. Amirabdollahian, and K. Dautenhahn. Towards Safe and Trustworthy Social Robots: Ethical Challenges and Practical Issues. In Proc. 7th International Conference on Social Robotics (ICSR), volume 9388 of LNCS, pages 584–593. Springer, 2015.
-  M. Salem, G. Lakatos, F. Amirabdollahian, and K. Dautenhahn. Would you trust a (faulty) robot?: Effects of error, task type and personality on human-robot cooperation and trust. In Proceedings of 10th ACM/IEEE International Conference on Human-Robot Interaction, HRI 2015, Portland, OR, USA, March 2-5, 2015, pages 141–148. ACM, 2015.
-  SARTRE project.
-  J. Schumann, P. Moosbrugger, and K. Y. Rozier. Runtime Analysis with R2U2: A Tool Exhibition Report. In Proceedings of the 16th International Conference on Runtime Verification (RV16), Madrid, Spain, September 2016. Springer-Verlag.
-  C. Scrapper, S. Balakirsky, and E. Messina. MOAST and USARSim: a combined framework for the development and testing of autonomous systems. In Unmanned Systems Technology VIII, volume 6230, page 62301T. International Society for Optics and Photonics, 2006.
-  SCSC - The Safety-Critical Systems Club. SCSC – goal structuring notation community standard (version 2).
-  N. Shankar. Trust and automation in verification tools. In Automated Technology for Verification and Analysis, 6th International Symposium, ATVA 2008, Seoul, Korea, October 20-23, 2008. Proceedings, pages 4–17, 2008.
-  D. Stout. Stone toolmaking and the evolution of human culture and cognition. Philosophical Transactions of the Royal Society B: Biological Sciences, 366(1567):1050–1059, 2011.
-  D. Swaroop. String stability of interconnected systems: An application to platooning in automated highway systems. California Partners for Advanced Transit and Highways (PATH), 1997.
-  D. Tabakov, K. Y. Rozier, and M. Y. Vardi. Optimized temporal monitors for SystemC. Formal Methods in System Design, 41(3):236–268, January 2012.
-  The European Parliament. Regulation (EU) 2018/1139 of the European Parliament and of the Council of 4 July 2018 on common rules in the field of civil aviation and establishing a European Union Aviation Safety Agency, and amending Regulations (EC) No 2111/2005, (EC) No 1008/2008, (EU) No 996/2010, (EU) No 376/2014 and Directives 2014/30/EU and 2014/53/EU of the European Parliament and of the Council, and repealing Regulations (EC) No 552/2004 and (EC) No 216/2008 of the European Parliament and of the Council and Council Regulation (EEC) No 3922/91 (Text with EEA relevance), 2018.
-  The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems, editor. Ethically Aligned Design: A Vision for Prioritizing Human Well-being with Autonomous and Intelligent Systems. IEEE, 2019.
-  The Software Testing Help (STH) Blog. Top 40 static code analysis tools (best source code analysis tools), 2019.
-  J. E. Tomayko. The story of self-repairing flight control systems. Dryden Historical Study No. 1. NASA Dryden Flight Research Center, 2003.
-  C. Torens, F. Adolf, and L. Goormann. Certification and software verification considerations for autonomous unmanned aircraft. J. Aerospace Inf. Sys., 11(10):649–664, 2014.
-  W. M. P. van der Aalst. Making work flow: On the application of petri nets to business process management. In J. Esparza and C. Lakos, editors, Proceedings of 23rd International Conference on Applications and Theory of Petri Nets (ICATPN’02), volume 2360 of Lecture Notes in Computer Science, pages 1–22. Springer, 2002.
-  W. M. P. van der Aalst. Process Mining – Discovery, Conformance and Enhancement of Business Processes. Springer, 2011.
-  Vienna Convention on Road Traffic, 1968. http://www.unece.org/trans/conventn/crt1968e.pdf.
-  W. Visser, K. Havelund, G. P. Brat, S. Park, and F. Lerda. Model Checking Programs. Automated Software Engineering, 10(2):203–232, 2003.
-  M. Webster, C. Dixon, M. Fisher, M. Salem, J. Saunders, K. Koay, K. Dautenhahn, and J. Saez-Pons. Toward Reliable Autonomous Robotic Assistants Through Formal Verification: A Case Study. IEEE Transactions on Human-Machine Systems, 46(2):186–196, 2016.
-  M. P. Webster, N. Cameron, M. Fisher, and M. Jump. Generating certification evidence for autonomous unmanned aircraft using model checking and simulation. J. Aerospace Inf. Sys., 11(5):258–279, 2014.
-  R. A. Whitehurst and T. F. Lunt. The sea view verification. In Proceedings of 2nd IEEE Computer Security Foundations Workshop (CSFW’89), pages 125–132. IEEE Computer Society, 1989.
-  A. F. T. Winfield, K. Michael, J. Pitt, and V. Evers. Machine ethics: The design and governance of ethical AI and autonomous systems. Proceedings of the IEEE, 107(3):509–517, 2019.
-  M. Winikoff. BDI Agent Testability Revisited. Journal of Autonomous Agents and Multi-Agent Systems (JAAMAS), 31(5):1094–1132, 2017.
-  M. Winikoff and S. Cranefield. On the testability of BDI agent systems. Journal of Artificial Intelligence Research, 51:71–131, 2014.
-  C. Wohlin and P. Runeson. Certification of software components. IEEE Trans. Software Eng., 20(6):494–499, 1994.
-  R. Woodman, A. F. T. Winfield, C. J. Harper, and M. Fraser. Building Safer Robots: Safety Driven Control. International Journal of Robotics Research, 31(13):1603–1626, 2012.
-  M. Wooldridge and N. R. Jennings, editors. Intelligent Agents, ECAI-94 Workshop on Agent Theories, Architectures, and Languages, Amsterdam, The Netherlands, August 8-9, 1994, Proceedings, volume LNCS 890. Springer, 1995.
-  M. Wooldridge and N. R. Jennings. Intelligent agents: theory and practice. Knowledge Eng. Review, 10(2):115–152, 1995.
-  L. Xiao, P. H. Lewis, and S. Dasmahapatra. Secure Interaction Models for the HealthAgents System. In Proc. 27th International Conference on Computer Safety, Reliability, and Security (SAFECOMP), volume 5219 of LNCS, pages 167–180. Springer, 2008.
-  M. Yang and K.-P. Chow. An information extraction framework for digital forensic investigations. In IFIP International Conference on Digital Forensics, pages 61–76. Springer, 2015.
-  N. Yorke-Smith, S. Saadati, K. L. Myers, and D. N. Morley. Like an intuitive and courteous butler: A proactive personal agent for task management. In Proceedings of 8th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS’09), pages 337–344, 2009.
-  H. Yu, C.-W. Lin, and B. Kim. Automotive software certification: current status and challenges. SAE International journal of passenger cars-electronic and electrical systems, 9(2016-01-0050):74–80, 2016.
-  N. Zhang, J. Wang, and Y. Ma. Mining domain knowledge on service goals from textual service descriptions. IEEE Transactions on Services Computing, pages 1–1, 2018.
-  Y. Zhao and K. Y. Rozier. Formal specification and verification of a coordination protocol for an automated air traffic control system. Science of Computer Programming Journal, 96(3):337–353, December 2014.