The development of Autonomous Vehicles (AVs) has made significant progress. It is expected that AVs will be mainstream by 2040  or earlier . An important aspect in the development of AVs is the safety assessment [3, 4, 5, 6, 7]. For legal and public acceptance of AVs, a clear definition of system performance is important, as are quantitative measures for the system quality. The more traditional methods [8, 9], used for evaluation of driver assistance systems, are no longer sufficient for the assessment of the safety of higher level AVs, as it is not feasible to complete the quantity of testing required by these methodologies . Therefore, the development of assessment methods is important to not delay the deployment of AVs .
One of the many challenges regarding the assessment of an AV  is to agree on a procedure that results in a reliable evaluation of the AV, provided that:
the assessment is sufficiently tailored to the Operational Design Domain (ODD) and Dynamic Driving Task (DDT) of the AV,
the proprietary and confidential information regarding the development of the AV are respected, and
the resources are limited.
In this paper, we propose a procedure for a assessment of an AV that takes into account the aforementioned considerations. The procedure assumes a scenario-based approach for assessing the safety [6, 11, 12]. In our procedure, three stakeholders are considered: the applicant that provides the AV, the authority that decides on the approval of the AV for road testing, and the assessor that performs the independent safety assessment of the AV. Based on the requirements set by the authority, the applicant and the assessor need to come to an agreement on the set of tests for the safety assessment. Based on the test results from both the applicant and the assessor, the assessor evaluates whether the AV is ready for deployment on the road and, if so, under which conditions.
To be best of the authors’ knowledge, there is no literature that provides a procedure for the assessment of AVs while distinguishing between the different stakeholders that are involved. The proposed procedure could be used within a legal framework for the approval of AVs for road testing. Eventually, as technology improves, the procedure might be a good starting point for developing the legal framework for the type-approval of AVs. In case of consumer testing performed by a New Car Assessment Programme (NCAP), the result is not an approval, but the proposed procedure could be used as well.
2 Problem definition
In this section, we first explain why many players in the automotive field support scenario-based testing for the assessment of performance aspects of the automated and autonomous vehicles, such as the safety assessment. Next, in Section 2.2, we elaborate on the different (type of) stakeholders that are involved in the assessment. Given an assessment with these stakeholders, in Section 2.3, we describe the problem and the corresponding practical challenges that our procedure should address.
2.1 Scenario-based testing
Perhaps the most basic way of assessing the performance of an AV is to drive with the AV in its intended area of operation. While this might provide useful data for further development of the AV, kalra2016driving,wachenfeld2016release show that the number of hours of driving that are required to demonstrate with enough certainty that the AV performs safely is infeasible. So, when it comes to demonstrating the reliability of an AV, another approach is necessary.
An advantage of scenario-based testing is that it allows for selecting those scenarios that are relevant for the safety evaluation. Therefore, a large repetition of scenarios that are relatively straightforward to deal with can be prevented. Furthermore, because virtual simulations can potentially be used for performing scenario-based tests, the number of physical tests can be reduced ploeg2018cetran, ultimately resulting in a less expensive assessment. However, one of the main challenges of scenario-based testing is the selection of the scenarios itself riedmaier2020survey.
Although there are challenges to be resolved for scenario-based testing, it is used in large research projects and by many players in the automotive field. For example, in European projects such as AdaptIVe roesener2017comprehensive, ENABLE-S3 leitner2019validation, and HEADSTART wagener2020headstart, scenario-based testing is proposed for assessing several performance aspects, such as safety and emissions. In Germany, a large project named PEGASUS was fully dedicated to scenario-based assessment of automated driving functions pegasus2019. Also, in Japan jacobo2019development and in Singapore cetran2020, a scenario-based approach is adopted for the assessment of AVs. To support the scenario-based assessment, different initiatives are started to create a database of scenarios and test cases, see, e.g., elrofai2018scenario,myers2020pass.
Considering the adoption of and the many resources dedicated to scenario-based assessment, it seems convincing that scenario-based assessment is a promising method for a framework for the safety assessment of AVs.
2.2 Stakeholders for the assessment
We consider three different types of stakeholder for the assessment of an AV for its readiness to be deployed in a certain operational area:
The assessor; and
The applicant is applying for the approval for the deployment of an AV for testing on the road. In practice, the applicant can be, e.g., the operator of the vehicles or the developer of the vehicle. The authority is the stakeholder that eventually decides whether the AV can be deployed, so this can be the local vehicle authority. It is a prerequisite for a proper process that the assessor is independent of both the applicant and the authority. Each stakeholder has different responsibilities, see Table 1. The applicant provides the AV to be assessed. The applicant is also responsible for providing an AV that meets the applicable safety requirements. The assessor is responsible for performing the tests in the assessment, not for deciding whether the AV is approved or not. Instead, the assessor advises the authority whether the AV can be approved and, if necessary, under which conditions. It is then up to the authority to make the final decision. Another responsibility of the authority is to organize a legal framework for the safety assessment of AVs and to set and communicate the requirements for the AV.
|Applicant||Apply for approval to deploy the AV.|
|Prepare the AV to meet the applicable safety requirements.|
|Provide the AV to be assessed.|
|Assessor||Perform an independent safety assessment of the AV.|
|Report results and advise the authority for AV approval.|
|Authority||Organize a legal framework for safety assessment of AVs.|
|Specify needs and set realistic AV safety requirements.|
|Decide on the approval of the AV.|
Within the terminology of the United Nations Economic Commission for Europe (UNECE), the assessor is called “technical service”.
The presented stakeholders are particularly applicable for non-US approach. Currently, the role of the applicant and the assessor is often represented by one stakeholder in the US. Nevertheless, the presented procedure might still be applicable if a distinction is made between these two roles within the organization of the applicable stakeholder.
Scenario-based assessment comes with many challenges. For example, questions like “How to generate the test cases?” and “How to validate the fidelity of the virtual simulations?” are discussed extensively in literature. In this paper, however, we want to specifically address the challenges that arise when considering the different stakeholders and each of their responsibilities and capabilities. Therefore, the problem we address in this paper can be formulated as follows:
What procedure could be used for the safety assessment of an Autonomous Vehicle (AV) by an independent assessor?
In the framework presented in this paper, the following challenges are addressed:
The tests need to be tailored to the operational design domain (ODD) and dynamic driving task (DDT) description of the AV. However, it is expected to be infeasible for the assessor to go through a rigorous analysis to define all relevant tests for each applicant.
It is assumed that the applicant does not want to disclose details of sensor and system implementation or even detailed test results because of proprietary or confidential information contained in these results. As a result, the assessor does not have access to these detailed test results. The challenge is that the assessor still needs enough confidence in the safe operation of the AV without having access to the detailed test results carried out by the applicant.
Due to the complex ODD and DDT, it is expected that many tests are required to obtain enough confidence in the assessment of the AV. Also, it is assumed that the assessor’s resources are too limited to conduct all tests physically. The challenge is that the procedure should still allow the assessor to have enough evidence that the AV is safe or not.
3 Procedure for the safety assessment
This paper assumes that many of the relevant tests for the safety assessment are performed in a virtual simulation environment that is controlled by the applicant. The proposed procedure intends to consider all results, both from virtual simulation and from actually performed physical tests. Where the assessor does not have access to the required models of the AV under test, the assessor will have the capability to perform physical tests on the AV. How to balance between the different results in the assessment, considering virtual and physical test results of the applicant and physical test results of the assessor is schematically presented in Figure 1. Each rectangular block represents an action. The procedure distinguishes between actions for which the applicant is responsible and actions for which the assessor is responsible. The procedure consists of the following actions:
The first action is to derive which system-level tests need to be performed with reference to the ODD and DDT of the AV under test. Here, “system-level” is mentioned explicitly, because it is assumed that also in case of a failure of any of the subsystems, the AV would fail the system-level tests. Note, however, that it is advised that the applicant ensures that each of the subsystems underwent a rigorous assessment before applying for the AV assessment.
If the derived tests are acceptable, the next action is to select the tests for the assessment. Here, a distinction is made between tests for which the applicant is fully responsible and physical tests that are conducted by the assessor. The latter will focus more on spot checking.
Once the tests are selected, these tests need to be conducted. The results of these tests will be described using prescribed metrics. Note, however, that these metrics may not contain too much information as it is assumed that the applicant does not want to disclose details of sensor and system implementation or even detailed test results because of the proprietary or confidential information contained in these results.
The final step is to assess the results from the tests and to formulate an advice for the authority on whether the AV is ready for deployment and under which conditions.
In the following sections, each of the actions are further detailed. We end this section with a short note on monitored deployment in case of a successful completion of the assessment.
3.1 Deriving test descriptions
Based on the ODD and the DDT of the AV, the tests are derived. Following the same reasoning as stellet2015taxonomy, a test is an evaluation of:
a statement on the system-under-test (test criteria; what are we going to evaluate using the test);
under a set of specified conditions (test case; how are we going to evaluate the test criteria);
using quantitative measures (metrics; how to express the outcome of the test quantitatively);
and a reference of what would be the acceptable outcome (reference; when is the outcome acceptable).
Since the applicant has designed and developed the AV, it is expected that the applicant has a clear notion of the tests that are required for a complete assessment of the AV and which the AV should appropriately handle. Similarly, if the set of relevant test descriptions is not complete during the development of the AV, it is conceivable that the AV will not operate safely for all circumstances possible within the ODD.
Although it is expected that the applicant provides all relevant test descriptions, it is important that the applicant and the assessor discuss these test descriptions, and that a check is made whether or not the test descriptions are complete and cover the ODD sufficiently. If an important test description is missing, it is conceivable that the AV is not specifically designed to pass the corresponding test. In order to assess the completeness of the test descriptions provided by the applicant, the assessor needs to define the test domain for the relevant system-level test descriptions and use these to investigate if any important test descriptions are missing. Here, the so-called test domain refers to a more high-level description of the range of tests that are expected, rather than an enumeration of the large number of relevant tests.
If it turns out that the test descriptions that are provided by the applicant are not complete, the process needs to be restarted, as indicated by the “Not OK” line in Figure 1. On the other hand, if the test descriptions are deemed to be complete enough, the assessment proceeds to the next step: selecting tests for the assessment.
3.2 Selecting tests for the assessment
In principle, the applicant is expected to provide results for all tests. In the next section, we explain how these results may look like. Based on these results, tests are selected for the physical testing performed by the assessor. This is indicated by the arrow pointing from “results of tests” of the applicant to “select tests for assessment” in Figure 1.
A test is selected for physical testing by the assessor if any of the following three statements are true:
The applicant does not provide a result. Although the applicant is expected to provide results for most tests, it might be possible that there are some tests for which the applicant does not have the resources to perform the tests reliably, for example if specific tooling is required. Note, however, that if the applicant does not provide results for too many tests, the assessment automatically results in a fail.
The result seems inconsistent. If there is sufficient reason for the assessor to not confide in the result provided by the applicant, the test can be performed by the assessor to check the result provided by the applicant.
The test is selected for spot checking. The main reason to perform spot checking is to assess the fidelity of the results provided by the assessor.
The process of test selection is summarized in Figure 2.
As explained in the previous section, the applicant is expected to provide results for most tests. However, it is assumed that the applicant does not want to disclose detailed test results. Therefore, a rating scheme is proposed. Using a specific metric for each test, three references are defined: an acceptable result, a fair result, and a good result. If the result of the test is worse that the acceptable result, a “fail” is reported. If the result passes the acceptable reference but not the what is defined as a fair result, an “acceptable” is reported. Similarly, a “fair” is reported if the result is between a fair and a good result. If the result is better than what has been defined as a good result, a “good” is reported. This is schematically shown in Figure 3.
In principle, the applicant is free to choose any method to derive the results. However, considering the large number of tests, the use of virtual simulations seems inevitable. In practice, it is expected that a both virtual simulations, physical tests, and a combination, such as hardware-in-the-loop testing, is used to determine the test results.
On the other hand, the tests by the assessor are performed physically. The main reason for this is that virtual simulations are ruled out as that would require the applicant to provide a model of the AV, which is expected to be impossible because of proprietary reasons.
3.4 Assess results
The following assessment results are distinguished per test:
In case the test results show that for the specific test the AV performs acceptable (i.e., “acceptable”, “fair”, or “good”, see Figure 3), the test is passed. If this is not the case, then the specific test fails.
Inspired by IATF 16949 on automotive quality management IATF16949, a passed test may result in a non-conformity. An “acceptable” result automatically leads to a non-conformity (NC). This means that the response of the AV deviates substantially from response that is qualified as “good”, but the deviation is not severe. Since the AV meets the minimum requirement for this test and consequently safety is not compromised, there is no reason to fail the AV based on this test. Nevertheless, an NC is issued to indicate that the applicant is asked to consider improvements, e.g., for a next version of the system.
In case the test is also performed by the assessor and the corresponding result is worse than the reported result of the applicant, this also leads to a NC.
The assessment of a test result might come with an observation (OB) that needs consideration of the applicant.
If a test results in a fail, then either the assessment results in a negative advice of the assessor to the authority or it is advised to only allow for deployment of the AV under certain conditions. For example, if the only tests that are failed consider low-light conditions, the AV might be deployed under the condition that it operates only from sunrise till sunset.
NCs and OBs do not lead to an immediate fail of the assessment. However, it is likely that they lead to a fail in a future assessment, e.g., when test criteria become increasingly demanding, and the applicant does not appropriately consider such NCs or OBs. NCs provide information to the applicant on how requirements might develop in the future, which, consequently, gives direction and motivation on continuous improvement of AVs regarding safety. On the other hand, many NCs – the AV barely passes the test in many cases – might mean that safety is compromised and, therefore, it might also result in a negative advice of the assessor to the authority regarding the deployment of the AV.
Note that when many NCs are observed, the AV probably will not be able to pass all tests if all tests would be performed physically by the assessor. Theoretically, this is however still possible. To minimize the risk of having an AV that passes all tests, but with many NCs, a system using demerit points is introduced. The AV starts with, e.g., 100 points, and in the assessment, 1, 2, or 3 points are subtracted for each NC, depending on the severity of the NC. Once the number of points for the AV are reduced to 0, then the AV is indicated to have failed he assessment because of an overrun of NCs. The numbers given here are merely provided as an example.
3.5 Monitored deployment
A successful completion of the proposed assessment might result in the approval for the deployment of the AV under the condition that the behavior of the AV on the road is continuously monitored. We propose that during such a deployment phase, the applicant is required to upload detailed driving data to allow for monitoring the AV behavior. This is implemented for two reasons:
After completion of the assessment pipeline, road and/or vehicle authorities may require the monitoring of safety continuously when driving on the public road.
The uploaded data may be used to improve the generation of tests and the selection of relevant test cases for a particular AV, as is possible that some tests have been overlooked during the initial assessment process or that situations on the road gradually change with changes in traffic, e.g.because of the introduction of new mobility systems.
The feedback to the data acquisition element allows for ongoing learning and improvement of the standards and assessment systems, while being able to adapt to new types of transportation such as personal mobility devices. For example, additional test cases could be identified and incorporated into future safety assessment procedures. A deployment might consider new operational areas, the extension of the scenario database with scenarios that potentially differ between such areas would then be covered. Moreover, to obtain a scenario database that is ‘complete’, i.e., statistically accurate, it is expected that operational data collection is required over an extended period, which most probably will not be realized before the deployment is operationalized. In other words, the imperfection of the assessment framework should not become a barrier for the introduction of new safe mobility solutions onto the market, in case these devices are tested to be safe for all currently known conditions. The assessment method, especially the step regarding monitored deployment, supports the continuous increase in knowledge on the state-of-the-art of road safety and herewith prepares the safety assessment method to be sustainable for the future.
In this section, we present a hypothetical example to illustrate how the safety assessment as discussed in the previous section may look like. Table 2 lists the results of an assessment consisting of 14 tests. Note that in reality, the number of tests is likely to be much larger, but for the sake of the example, we keep the number of tests rather limited.
The applicant reports a good result for the first test (test ID 1.1). Because the assessor comes to the same conclusion, the fidelity check is passed as well as the test. The next three tests (1.2 till 1.4) are not performed by the assessor, so the reported results of the applicant are included in the assessment result. These three tests are all passed, but because the last of these tests (1.4) barely passes the minimum requirement (an “acceptable” is reported), an NC is issued.
The tests 2.1 till 2.4 are all passed. However, the applicant reports a better result for test 2.2 than the assessor. Therefore, the fidelity check failed, and an NC is issued. For test 2.4, an OB is made. This could be, e.g., because during the test that is performed by the assessor the AV showed erratic behavior even though safety has not been compromised.
For the next three tests (3.1 till 3.3), an NC is issued, and one test is failed. The NC is issued for two reasons: the assessor reports an acceptable result and the fidelity check has failed. The applicant reports a fail for test 3.3. There is no reason for the assessor to also perform this test, because regardless of that result, the applicant has failed the test. As mentioned in Section 3.4, this might lead to a failed assessment or to some restrictions for the deployment of the AV. Note that for test 3.2, the result of the assessor is better than the reported result of the applicant. This might be caused by the applicant reporting the worst-case outcome, e.g., the outcome with a relatively long detection delay of an object while the detection delay is shorter during the test performed by the assessor. Because the result of the assessor is better, this does not lead to an NC.
For the last three tests, again an NC is issued, and a test is failed. The NC is issued because the applicant reports an acceptable result. Test 4.3 is failed even though the applicant has not reported to fail this test. This might lead to additional measures that need to be taken before the applicant can proceed with the deployment of the AV.
We proposed a procedure for the assessment of an Autonomous Vehicle (AV) in order to answer the question “What procedure could be used for the safety assessment of an AV by an independent assessor?” In the proposed procedure, a distinction is made between activities of the applicant and the independent assessor, while considering the limited resources and the proprietary and confidential information related to the development of the AV.
There are still some open questions. E.g., how to assure that the set of tests cover the ODD and DDT sufficiently? Also, since virtual simulations seem inevitable, how to prove that the results of the virtual simulations alongside the limited number of physical tests provide a reliable evaluation of the AV?
Despite these open questions, the proposed procedure can be used as a starting point for discussions on the development of a legal framework for authorities for the safety assurance of AVs before deploying these vehicle on the public road.
-  A. M. Madni, “Autonomous system-of-systems,” in Transdisciplinary Systems Engineering. Springer, 2018, pp. 161–186.
-  K. Bimbraw, “Autonomous cars: Past, present and future a review of the developments in the last century, the present scenario and the expected future of autonomous vehicle technology,” in 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO), vol. 1, 7 2015, pp. 191–198.
-  K. Bengler, K. Dietmayer, B. Färber, M. Maurer, C. Stiller, and H. Winner, “Three decades of driver assistance systems: Review and future perspectives,” IEEE Intelligent Transportation Systems Magazine, vol. 6, no. 4, pp. 6–22, 2014.
-  J. E. Stellet, M. R. Zofka, J. Schumacher, T. Schamm, F. Niewels, and J. M. Zöllner, “Testing of advanced driver assistance towards automated driving: A survey and taxonomy on existing approaches and open questions,” in IEEE 18th International Conference on Intelligent Transportation Systems, 9 2015, pp. 1455–1462.
-  T. Helmer, K. Kompaß, L. Wang, T. Kühbeck, and R. Kates, Safety Performance Assessment of Assisted and Automated Driving in Traffic: Simulation as Knowledge Synthesis. Springer International Publishing, 2017, pp. 473–494.
-  A. Pütz, A. Zlocki, J. Bock, and L. Eckstein, “System validation of highly automated vehicles with a database of relevant traffic scenarios,” in 12th ITS European Congress, 2017, pp. 1–8.
-  W. Wachenfeld and H. Winner, “The release of autonomous vehicles,” in Autonomous Driving. Springer, 2016, pp. 425–449.
-  ISO 26262, ISO 26262: Road Vehicles – Functional Safety, International Organization for Standardization Std., 2018. [Online]. Available: https://www.iso.org/standard/68383.html
-  A. Knapp, M. Neumann, M. Brockmann, R. Walz, and T. Winkle, “Code of practice for the design and evaluation of ADAS,” RESPONSE III: a PReVENT Project, 8 2009.
-  P. Koopman and M. Wagner, “Challenges in autonomous vehicle testing and validation,” SAE International Journal of Transportation Safety, vol. 4, pp. 15–24, 2016.
-  H. Winner, W. Wachenfeld, and P. Junietz, “Safety assurance for highly automated driving-the pegasus approach,” in TRB Annual Meeting, 2017.
-  E. Thorn, S. Kimmel, and M. Chaka, “A framework for automated driving system testable cases and scenarios,” National Highway Traffic Safety Administration, Tech. Rep. DOT HS 812623, 2018.
-  N. Kalra and S. M. Paddock, “Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?” Transportation Research Part A: Policy and Practice, vol. 94, pp. 182–193, 2016.
-  J. Ploeg, E. de Gelder, M. Slavík, E. Querner, T. Webster, and N. de Boer, “Scenario-based safety assessment framework for automated vehicles,” in 16th ITS Asia-Pacific Forum, 2018, pp. 713–726.
-  S. Riedmaier, T. Ponn, D. Ludwig, B. Schick, and F. Diermeyer, “Survey on scenario-based safety assessment of automated vehicles,” IEEE Access, vol. 8, pp. 87 456–87 477, 2020.
-  C. Roesener, J. Sauerbier, A. Zlocki, F. Fahrenkrog, L. Wang, A. Várhelyi, E. de Gelder, J. Dufils, S. Breunig, P. Mejuto, F. Tango, and J. Lanati, “A comprehensive evaluation approach for highly automated driving,” in 25th International Technical Conference on the Enhanced Safety of Vehicles (ESV), 2017.
-  A. Leitner, D. Watzenig, and J. Ibanez-Guzma, Validation and Verification of Automated Systems. Springer, 2019, results of the ENABLE-S3 Project.
-  N. Wagener, J.-B. Coget, A. Ballis, P. Weißensteiner, G. Morandin, X. Sellar, M. Nieto, O. Bartels, and G. Feddes, “Common methodology for test, validation and certification,” IKA, Tech. Rep. HEADSTART D2.1, 2020.
-  PEGASUS, “PEGASUS method: An overview,” Tech. Rep., 2019, available at https://www.pegasusprojekt.de/en/. Accessed in October, 2020.
-  J. Antona-Makoshi, N. Uchida, K. Yamazaki, K. Ozawa, E. Kitahara, and S. Taniguchi, “Development of a safety assurance process for autonomous vehicles in japan,” in 26th International Technical Conference on the Enhanced Safety of Vehicles (ESV), 2019.
-  CETRAN. Centre of Excellence for Testing & Research of Autonomous Vehicles – NTU. Accessed in October, 2020. [Online]. Available: https://cetran.sg/
-  H. Elrofai, J.-P. Paardekooper, E. de Gelder, S. Kalisvaart, and O. Op den Camp, “Scenario-based safety validation of connected and automated driving,” Netherlands Organization for Applied Scientific Research, TNO, Tech. Rep., 2018. [Online]. Available: http://publications.tno.nl/publication/34626550/AyT8Zc/TNO-2018-streetwise.pdf
-  R. Myers and Z. Saigol, “Pass-fail criteria for scenario-based testing of automated driving systems,” arXiv preprint arXiv:2005.09417, 2020.
-  IATF 16949:2016, “International standard for automotive quality management systems,” The International Automotive Task Force (IATF), Tech. Rep., 2016.