SafeBench: A Benchmarking Platform for Safety Evaluation of Autonomous Vehicles

As shown by recent studies, machine intelligence-enabled systems are vulnerable to test cases resulting from either adversarial manipulation or natural distribution shifts. This has raised great concerns about deploying machine learning algorithms for real-world applications, especially in the safety-critical domains such as autonomous driving (AD). On the other hand, traditional AD testing on naturalistic scenarios requires hundreds of millions of driving miles due to the high dimensionality and rareness of the safety-critical scenarios in the real world. As a result, several approaches for autonomous driving evaluation have been explored, which are usually, however, based on different simulation platforms, types of safety-critical scenarios, scenario generation algorithms, and driving route variations. Thus, despite a large amount of effort in autonomous driving testing, it is still challenging to compare and understand the effectiveness and efficiency of different testing scenario generation algorithms and testing mechanisms under similar conditions. In this paper, we aim to provide the first unified platform SafeBench to integrate different types of safety-critical testing scenarios, scenario generation algorithms, and other variations such as driving routes and environments. Meanwhile, we implement 4 deep reinforcement learning-based AD algorithms with 4 types of input (e.g., bird's-eye view, camera) to perform fair comparisons on SafeBench. We find our generated testing scenarios are indeed more challenging and observe the trade-off between the performance of AD agents under benign and safety-critical testing scenarios. We believe our unified platform SafeBench for large-scale and effective autonomous driving testing will motivate the development of new testing scenario generation and safe AD algorithms. SafeBench is available at https://safebench.github.io.

READ FULL TEXT VIEW PDF

page 4

page 15

page 16

06/17/2020

An Open-Source Scenario Architect for Autonomous Vehicles

The development of software components for autonomous driving functions ...
02/04/2022

A Survey on Safety-Critical Driving Scenario Generation – A Methodological Perspective

Autonomous driving systems have witnessed a significant development duri...
03/12/2021

Generating and Characterizing Scenarios for Safety Testing of Autonomous Vehicles

Extracting interesting scenarios from real-world data as well as generat...
06/02/2022

A Real-time Critical-scenario-generation Framework for Testing Autonomous Driving System

In order to find the most likely failure scenarios which may occur under...
08/07/2019

Unified Simulation and Test Platform for Control Systems of Unmanned Vehicles

Control systems on unmanned vehicles are safety-critical systems whose r...
12/17/2021

scenoRITA: Generating Less-Redundant, Safety-Critical and Motion Sickness-Inducing Scenarios for Autonomous Vehicles

There is tremendous global enthusiasm for research, development, and dep...
12/09/2020

Transfer Learning for Efficient Iterative Safety Validation

Safety validation is important during the development of safety-critical...

1 Introduction

Innovations driven by recent progress in machine learning (ML) have shown human-competitive performance in sensing (Silver et al., 2018), decision-making (He et al., 2015), and manipulation (Agostinelli et al., 2019)

. However, several studies have shown that when such powerful ML models are exposed to adversarial attacks they can be fooled, evaded, and misled in ways that would have profound security implications: image recognition, natural language processing, and audio recognition systems have all been attacked 

(Xiang et al., 2019; Huang et al., 2017a, 2011; Yuan et al., 2018). In addition, recent studies have shown that by putting printed stickers onto a stop sign, the camera perception system of an autonomous vehicle (AV) can be easily fooled to misrecognize it as a speed limit sign (adversarial target) (Evtimov et al., 2017); a generated adversarial 3D object can easily fool the point clouds based models/systems such as LiDAR sensors (Cao et al., 2020); a carefully crafted adversarial object can even fool sophisticated sensor-fusion based ML systems equipped on autonomous vehicles (Cao et al., 2021), among other attacks (Wallace et al., 2019; Carlini and Wagner, 2018). Such attacks will usually lead to potentially severe consequences (e.g. car accidents) and have raised great security concerns (Szegedy et al., 2014; Goodfellow et al., 2015; Kurakin et al., 2016). As ML-based models and approaches have expanded to real-world safety-critical applications, such as Autonomous Driving (AD), the question of safety is becoming a crux for the transition from theories to practice (Administration, 2017; Matheny et al., 2019), and it is vitally important to quantitatively and efficiently evaluate the robustness or safety of safety-critical applications before their massive production and deployment. As listed in the

National Artificial Intelligence Research and Development Strategic Plan

(Kratsios, 2019), developing effective evaluation methods for AI and ML is considered one of the top priorities. Failing to meet this demand will cause death, stifle innovations, and hurt our economy, among other socially responsible issues.

Challenges. Despite the great importance of safety evaluation for AD algorithms, it is challenging to comprehensively and quantitatively evaluate AD algorithms due to both real-world data and evaluation design challenges. First, in practice, the safety-critical driving scenarios are “rare” – can be found by driving every miles (6), which leads to the fact that current AD testing requires driving millions of miles with large economic and environmental costs. In addition, Such rarity also requires the evaluation methods to have an accelerated feature with a probabilistic convergence guarantee to avoid being over-optimistic. Previous work (Zhao, 2016; O’Kelly et al., 2018) can be realized for abstract simple models by using large deviation theories such as importance sampling (IS) and cross entropy (CE) (Bucklew, 2013). However, these approaches are shown to have reached bottlenecks when dealing with ML algorithms with increasing complexity. In fact, recent studies (Arief et al., 2020) have shown that these classical IS/CE based approaches and tools may consistently underestimate the risk when dealing with complex systems. Moreover, such peril has been identified in different evaluation approaches (Zhao et al., 2016, 2018; Huang et al., 2018b, 2017b, c, a), which have been already adopted by industry (53) and test agencies (Peng, 2020) in the U.S. to assess the safety of AVs. Second, although several learning-based scenario generation approaches are later proposed to overcome the above challenge (Najm et al., 2007; Wang et al., 2021a; Calò et al., 2020; Wen et al., 2020)

, existing evaluation tools and platforms are usually based on their own design, such as dataset selection, safety-critical scenario definition and generation, evaluation metrics, and input types. This makes it very challenging to fairly compare different AD algorithms or interpret different evaluation results.

In this paper, we focus on designing and developing the first unified robustness and safety evaluation platform for AD algorithms, SafeBench

. In particular, we design SafeBench based on the open-sourced simulation platform Carla 

(Dosovitskiy et al., 2017). SafeBench consists of modules, including Agent Node, Ego Vehicle, Scenario Node, and Evaluation Node. Based on our platform, we systematically evaluate the AD algorithms on generated safety-critical testing scenarios, such as Straight Obstacle and Lane Changing together with other benign scenarios. For each safety-critical scenario, we implement scenario generation algorithms for comparison. In addition, for each scenario, we select diverse driving routes to ensure the generalization of our evaluation results. We report the evaluation results based on metrics, such as collision rate, frequency of running red lights, and average percentage of route completion. Finally, we developed reinforcement learning-based AD algorithms with different perceptual capabilities on SafeBench. Specifically, we provide input types, ranging from low-dimensional state representations to complicated visual inputs. Based on our comprehensive evaluation, we find that (1) there is a performance trade-off for different AD algorithms under benign and safety-critical scenarios, (2) some safety-critical scenarios have higher transferability across AD algorithms, (3) different scenario generation algorithms achieve different levels of effectiveness even when generating the same scenario, (4) different AD algorithms achieve advantages over others under different metrics. Our findings suggest that testing AD algorithms on high-quality safety-critical scenarios is necessary and can largely improve testing efficiency, and we should consider a combination of testing scenarios and generation algorithms for effective testing.

Contributions. In this work, we aim to provide the first unified evaluation platform for different AD algorithms by generating diverse safety-critical scenarios with different generation algorithms and evaluation metrics. Our evaluation platform SafeBench includes the following properties.

  • [leftmargin=*]

  • Unified benchmarking platform with modularized design. Our evaluation platform consists of 4 modules, including Ego vehicle, Agent, Scenario, and Evaluation. It is also flexible to replace, add, or delete modules for future functionalities and evaluations.

  • Comprehensive coverage of safety-critical scenario generation. In SafeBench, we have integrated testing scenarios, which have provided comprehensive coverage of known safety-critical scenarios in the real world, and it is flexible to add more testing scenarios by applying generation methods on new template scenarios.

  • Comprehensive coverage of scenario generation algorithms. For each testing safety-critical scenario, we developed 4 generation algorithms, so that we are able to not only evaluate AD safety on the scenario level, but also on the generation algorithm level.

  • Diverse metrics on safety measurement of different AD algorithms. We report our evaluation based on evaluation metrics, based on three levels: safety, functionality, and etiquette.

  • General leaderboard of safety evaluation and extensible findings. We provide a comprehensive leaderboard for the robustness and safety evaluation of AD algorithms, and we observe different performances of these AD algorithms under different controllable settings.

  • High flexibility and effectiveness. Our evaluation platform is flexible to be integrated into other simulation platforms and different devices. Once the AD algorithm is trained, it is very effective to be tested on our generated testing scenarios.

2 Related work

Existing AD algorithm evaluation approaches and platforms can be categorized into three types based on how the testing driving scenarios are generated. First, the data-driven based generation and testing approaches (Scanlon et al., 2021; Knies and Diermeyer, 2020; Ding et al., 2018, 2020b)

focus on real-world data sampling and distribution density estimation. This line of research is able to model the real-world driving conditions, while requiring a large number of data collection to capture the “rare” safety-critical scenarios for testing. Second, the

adversary-based generation and testing approaches (Ding et al., 2021a; Zhang et al., 2022; Feng et al., 2021) model the surrounding agents (e.g., vehicles and pedestrians) as adversarial agents to generate safety-critical driving scenarios. Third, the knowledge-based generation and testing approaches (Ding et al., 2021b; Wang et al., 2021b; Bagschik et al., 2018) aim to integrate domain knowledge such as traffic rules as additional constraints to guide the testing scenario generation process. Recently, the latter two categories have shown efficient and effective evaluation results under specific driving environments and settings, and therefore we mainly focus on them in this work. However, existing driving scenario generation and testing approaches are developed on different platforms with different AD algorithms and sensor configurations, etc., making it challenging to directly compare the effectiveness of different testing scenarios, scenario generation algorithms, and the safety of AD algorithms. Thus, in this work we will provide the first unified platform SafeBench, to generate safety-critical scenarios with different algorithms considering a range of environments and configurations for fair comparison based on a comprehensive set of evaluation metrics.

3 SafeBench: benchmarking platform for safety evaluation

In this section, we will first provide an overview of our platform SafeBench, followed by the details of our developed scenario generation algorithms and variants, as well as the evaluation metrics.

3.1 Platform structure

Overview. In Figure

Figure 1: Left: Framework overview of SafeBench. Right: 8 safety-critical driving scenarios - (1) Straight Obstacle (2) Turning Obstacle (3) Lane Changing (4) Vehicle Passing (5) Red-light Running (6) Unprotected Left-turn (7) Right-turn (8) Crossing Negotiation.

1 , we show the structure of our SafeBench platform. This platform runs in the Docker container and is built upon the Carla simulator (Dosovitskiy et al., 2017). We use ROS for communication between the modules in the platform. In particular, SafeBench consists of 4 components (nodes) as introduced in the following.

Ego vehicle provides a virtual vehicle including the configurations of sensors (e.g., the positions and parameters of LiDAR, Camera, and Radar), the global planner, and the appearance of the vehicle. The testing AD algorithms are deployed in this node to interact with the driving scenarios. Users can change the configuration of this node to satisfy the requirement of their algorithms.

Agent node is designed to train and manage AD algorithms for ego and surrounding vehicles, taking as input the observation information from the testing scenarios and outputting the controlling signals. AD algorithms managed by this node can be trained on our platform.

Scenario node is the core part of SafeBench, which is responsible for organizing and generating testing scenarios. These scenarios control the behaviors of traffic participants (e.g., pedestrians and surrounding vehicles) and static driving environments (e.g., road layout and status of traffic lights).

Evaluation node is designed to provide comprehensive evaluations by testing different AD algorithms under diverse generated driving scenarios based on different metrics. The Evaluation Node collects all information during testing and provides an evaluation summary on different levels.

3.2 Safety-critical testing scenarios

Figure 2: An example of route variants in Turning Obstacle scenario, consisting of a different number of lanes (2-lane vs. 3-lane road) and surrounding buildings.

In this section, we first define the safety-critical traffic testing scenarios we considered in this work, containing 8 most representative and challenging driving scenarios of pre-crash traffic (Najm et al., 2007) summarized by the National Highway Traffic Safety Administration (NHTSA).

In addition, for each scenario, we design ten diverse driving routes that vary in terms of surrounding environments, number of lanes, road signs, etc.

Pre-crash safety-critical scenarios.

We show the 8 pre-crash scenarios in the right part of Figure 1. In each scenario, the ego vehicle needs to drive along a pre-defined route and react to emergencies that occur on the road while driving. Throughout the process, the ego vehicle should follow the traffic rules and avoid potential car accidents. We provide more detailed scenario definitions and descriptions in Section A.2.

Driving routes.

In practice, a driving scenario may involve many variants. For instance, small changes in the vehicle location or in the surrounding environment may lead to big changes in vehicle decision-making. In order to provide a more comprehensive safety evaluation, we design 10 driving routes for each safety-critical scenario. Each driving route has a sequence of pre-defined waypoints. Different driving routes of the same scenario may have a different number of lanes, different scenes (e.g., intersections, T-junctions, bridges, etc.), or different road signs, which restrict the vehicle behaviors in different ways. We show example route variants of Turning Obstacle in Figure 2. We also provide more detailed examples of variants of other scenarios in Section A.2.

3.3 Safety-critical scenario generation algorithms

In this section, we detail how we collect and optimize safety-critical testing scenarios using different generation algorithms. Specifically, for each driving route mentioned above, we develop 4 algorithms to generate various testing samples. These generation algorithms mainly fall into two categories: adversary-based generation and knowledge-based generation.

3.3.1 Adversary-based generation

The state-of-the-art adversarial generation algorithms usually consist of two components: the scenario generator, and the victim model (i.e., the ego vehicle or tested AD agent). Existing adversarial generation frameworks adopt different strategies to manipulate traffic scenarios, such as perturbing the position of surrounding vehicles (SVs) or forcing a cyclist to take an adversarial action, such that the victim model will crash into SVs and fail in the generated scenario. To examine the safety and robustness of the tested AD agent against such adversarial scenarios, we select two representative algorithms as follows: () Learning-to-collide (LC) (Ding et al., 2020a)

is a black-box algorithm that optimizes the initial poses of a cyclist to attack the AD algorithm. Following the default setting, we formulate the traffic scenarios as a series of auto-regressive building blocks and obtain the generated scenarios by sampling from the joint distribution of these blocks. The policy gradient method REINFORCE 

(Williams, 1992) is used to solve the scenario optimization problem. In LC, the authors only focus on generating Turning Obstacle scenario, so we adapt the method to all the scenarios and generate different initial conditions for all the driving routes. () AdvSim (AS) (Wang et al., 2021a) directly manipulates existing trajectories to perturb the driving paths of SVs, posing dangers to the tested AD agent. We follow the default setting and use the kinematic bicycle model (Polack et al., 2017) to represent and calculate the full trajectory of SVs. Based on the results obtained by interacting with the driving environment, we optimize the trajectory parameters using the black-box search algorithm Bayesian Optimization (Srinivas et al., 2010; Ru et al., 2019). Similarly, in our experiments, we generate adversarial trajectories for all the route variants.

3.3.2 Knowledge-based generation

In the physical world, driving scenarios need to satisfy traffic rules and physical laws. Scenarios generated by adversarial algorithms, however, sometimes violate these rules. Therefore, we develop novel generation algorithms that integrated domain knowledge into the generation process. We select two representative algorithms as follows. () Carla Scenario Generator (CS) (Contributors, 2019) is a module built on the Carla Simulator (Dosovitskiy et al., 2017) which uses rule-based methods to construct testing scenarios. Following the standard process, we adopt the rules and use grid search to generate safety-critical scenario parameters for all the traffic scenarios. () Adversarial Trajectory Optimization (AT) (Zhang et al., 2022) uses explicit knowledge as constraints to guide the scenario optimization process. We adopt the same constraints that needed to be satisfied and use the default PSO-based (Poli et al., 2007) blackbox optimization for generating all kinds of testing scenarios in SafeBench.

3.4 Evaluation metrics

In this section, we introduce the evaluation metrics used in SafeBench. Specifically, we evaluate the performance of AD algorithms on 3 levels: Safety level, Functionality level, and Etiquette level. Within each level, we design several metrics focusing on different aspects. Finally, an overall score is calculated as a weighted sum of all the evaluation metrics introduced below.

Safety level

To evaluate the safety of given AD algorithms, we consider evaluation metrics focusing on serious violations of traffic rules: collision rate (CR), frequency of running red lights (RR), frequency of running stop signs (SS), and average distance driven out of road (OR). Formally, we define the scenario trajectory as , which is sampled from a scenario distribution , then the collisions happened in one scenario after testing the AD algorithm can be represented as . Similarly, we obtain the number of running red lights , running stop signs , and distance driven out of road . The metrics are concretely calculated as: , , , and .

Functionality level

In each testing scenario, the AD agent is expected to follow and complete a specific route. This level of evaluation metrics is used to measure the functional ability of AD agents to finish such a task. We develop metrics as follows: route following stability (RF), average percentage of route completion (Comp), and average time spent to complete the route (TS). To calculate RF, we use the average distance between the ego vehicle and the reference route during each testing . Then we calculate , where is a constant indicating the maximum deviation distance. More details of the settings for the constant and other parameters can be found in Section A.3. Comp is calculated as , where is the percentage of route completion of each testing scenario. TS is the average time spent for completing the routes successfully: , where denotes the time cost of each testing scenario.

Etiquette level

In practice, driver etiquette is an indicator of the driving skills of AD algorithms. Here we consider 3 metrics accordingly: average acceleration (ACC), average yaw velocity (YV), and frequency of lane invasion (LI). Similarly, these metrics are calculated as the expectation over all testing scenarios: , , and , where , , denote the accelerations, yaw velocities, and number of lane invasions respectively.

Overall score

To obtain an evaluation overview of the quality of AD algorithms, we aggregate all the metrics and report an overall score (OS), which is a weighted sum of the metrics introduced above. Specifically, the overall score is calculated as: , where is the metric, is the corresponding weight, is defined as

(1)

where is a constant indicating the maximum allowed value of . More details of the constant and weight selection are in Section A.3.

4 Benchmark evaluation on SafeBench

In this section, we will first introduce the AD algorithms we will test which are based on different input state types, then illustrate our testing scenario generation and selection details, followed by our comprehensive benchmark results and corresponding observations and findings.

4.1 AD algorithms tested on SafeBench

We test various types of algorithms based on the safety-critical scenarios in SafeBench. We particularly focus on reinforcement learning-based self-driving methods, since they require minimum domain knowledge of the overall system and driving scenarios Sallab et al. (2017); Chen et al. (2019, 2021); Kiran et al. (2021). One only needs to specify the reward function, action space, and state space, then trains the agent by interacting with the scenario, and finally obtain a self-driving agent with reasonable performance. The reward function is given by a linear combination of the route following bonus, the collision penalty, the speeding penalty, and the energy consumption penalty. The action space is specified by the steering and throttle of the vehicle. See details in section A.4.

We select representative deep RL methods for evaluation, including a stochastic on-policy algorithm – Proximal Policy Optimization (PPO) Schulman et al. (2017), a stochastic off-policy method – Soft Actor Critic (SAC) Haarnoja et al. (2018), and two deterministic off-policy approaches – Deep Deterministic Policy Gradient (DDPG) Lillicrap et al. (2015) and Twin Delayed DDPG (TD3) Fujimoto et al. (2018). To encourage the diversity of evaluation agents, we vary the state space to equip them with different perceptual capabilities. We design state spaces for each RL algorithm based on previous works Chen et al. (2019, 2021)

as follows. The detailed model architecture and hyperparameters are presented in

section A.4.

  • [leftmargin=*]

  • 4D. The basic observation type contains only dimensions of observation: distance to the waypoint, longitude speed, angular speed, and a front-vehicle detection signal.

  • 4D+Dir. For a more complex observation type, we add another

    dimensions of observations, which are "Command (turn left, turn right or go straight)" and vectors that represent the direction of the ego vehicle, current waypoint, and target waypoint.

  • 4D+BEV. We render the ego vehicle’s local semantic map as the bird’s-eye view (BEV) image, where the vehicles are represented by boxes. Lanes and routes are represented by line segments. We incorporate the BEV image together with dimensional states to form this observation type.

  • 4D+Cam. This observation type includes an image captured by the front camera with 4D.

4.2 Driving scenarios for testing

Scenario generation. We apply 4 safety-critical scenario generation algorithms to 8 template scenarios, each of which contains 10 diverse driving routes. For each generation algorithm, we keep testing scenarios based on their qualities (more detailed statistics in Section A.1). Thus, in total, we generate testing scenarios for evaluation. We note that some scenario generation algorithms require a surrogate model to search for effective safety-critical configurations. For instance, we follow the setup of LC (Ding et al., 2020a) to train a surrogate SAC model based on random benign scenarios. More details about the surrogate model can be found in Section A.5.

Figure 3: Collision statistics of generated scenarios before scenario selection. Red bars represent the selected ones with high collision rate. Green bars represent the unselected scenarios with low collision rate.

Scenario selection. After collecting the raw testing scenarios, we select scenarios with desired properties. Specifically, we test all the generated scenarios on AD algorithms with basic observation type and select scenarios that cause the most collisions. In Figure 3, we show a histogram of the distribution for collisions. We only keep scenarios that cause collisions for at least algorithms during testing, which is shown in red in Figure 3. The selected testing scenarios have high transferability across AD algorithms and high risk levels, which further improves both the effectiveness and efficiency of AD evaluation. After the selection, we obtain testing scenarios in total. More details can be found in Section A.1.

Metric Algo. Traffic Scenarios Avg.
Straight Obstacle Turning Obstacle Lane Changing Vehicle Passing Red-light Running Unprotected Left-turn Right- turn Crossing Negotiation
CR LC 0.320 0.140 0.560 0.920 0.410 0.630 0.458 0.470 0.489
AS 0.570 0.350 0.650 0.900 0.600 0.820 0.520 0.550 0.620
CS 0.610 0.630 0.322 0.900 0.767 0.756 0.667 0.711 0.670
AT 0.680 0.310 0.700 0.930 1.000 0.850 0.500 0.900 0.734
S-CR LC 0.756 0.923 0.560 0.919 0.833 0.870 0.661 0.793 0.789
AS 0.794 0.595 0.650 0.900 0.833 0.930 0.792 0.797 0.787
CS 0.967 0.684 0.322 0.900 0.932 0.870 0.711 0.797 0.773
AT 0.847 0.485 0.697 0.930 1.000 0.966 0.562 1.000 0.811
OS LC 0.765 0.825 0.613 0.451 0.755 0.632 0.630 0.646 0.665
AS 0.654 0.718 0.577 0.465 0.659 0.544 0.599 0.606 0.603
CS 0.629 0.577 0.738 0.464 0.569 0.571 0.520 0.522 0.574
AT 0.600 0.737 0.557 0.455 0.460 0.526 0.607 0.423 0.546
S-OS LC 0.565 0.461 0.613 0.451 0.533 0.518 0.528 0.476 0.518
AS 0.548 0.600 0.577 0.465 0.535 0.492 0.451 0.480 0.518
CS 0.465 0.550 0.738 0.464 0.483 0.519 0.496 0.473 0.524
AT 0.523 0.654 0.558 0.455 0.460 0.471 0.574 0.372 0.508
SR LC 0.410 0.130 1.000 0.990 0.420 0.690 0.590 0.580 0.601
AS 0.680 0.420 1.000 1.000 0.720 0.860 0.530 0.640 0.731
CS 0.600 0.760 1.000 1.000 0.822 0.856 0.922 0.878 0.855
AT 0.590 0.330 0.990 1.000 1.000 0.870 0.890 0.900 0.821
Table 1: Statistics of scenario generation/selection. We report collision rate (CR) before and after scenario selection (S-CR) to measure the effectiveness of different scenario generation algorithms. The overall score (OS) before and after scenario selection (S-OS) are used to demonstrate the safety-critical scenario generation capability of different algorithms. The selection rate (SR) is reported to evaluate the transferability of generation algorithms across AD agents. The last column shows the average over all the scenarios, with bold numbers indicating the best performance among the generation algorithms. LC: Learning-to-collide, AS: AdvSim, CS: Carla Scenario Generator, AT: Adversarial Trajectory Optimization, /: higher/lower the better.

Analysis of generation algorithms and testing scenarios. We analyze the properties of scenario generation algorithms based on a range of metrics, including the collision rate (CR), overall score (OS), and the overall selection rate (SR) for each scenario before and after selection (full statistics in Section A.6). As shown in Table 1, first, the scenario selection process indeed helps to improve CR of the testing scenarios to induce more safety-critical ones: with the highest improvement as for LC. Second, AT is the most effective algorithm to cause both high CR and low OS. In fact, of the generated scenarios by AT can cause collisions to the surrogate model and it will increase to after scenario selection. The scenarios generated by AT achieve OS as and it will further decrease to after scenario selection, indicating its testing effectiveness. Third, regarding the overall SR of different algorithms, scenarios generated by CS achieve the highest SR, which means CS is the best algorithm in terms of transferability across different AD algorithms. Specifically, of scenarios generated by CS can successfully cause collisions to other unseen AD agents. Finally, among different scenarios, Vehicle Passing is the most difficult with the highest CR and lowest OS.

4.3 Benchmark results

Model Traffic Scenarios Avg. Avg.
Straight Obstacle Turning Obstacle Lane Changing Vehicle Passing Red-light Running Unprotected Left-turn Right- turn Crossing Negotiation Benign Safety- critical
DDPG (4D) 0.545 0.526 0.440 0.501 0.611 0.444 0.411 0.507 0.603 0.498
SAC (4D) 0.533 0.474 0.577 0.471 0.482 0.501 0.503 0.432 0.833 0.497
TD3 (4D) 0.479 0.596 0.477 0.592 0.532 0.525 0.459 0.482 0.830 0.518
PPO (4D) 0.761 0.611 0.426 0.432 0.755 0.728 0.605 0.655 0.819 0.622
Table 2: The performance of AD algorithms on SafeBench. We report the average overall score (OS) on testing scenarios generated by all the scenario generation algorithms with driving route variations. Benign indicates the performance of AD algorithms tested on normal driving scenarios. The last two columns show the OS averaged over all benign and safety-critical traffic scenarios.

We train our AD algorithms on random benign scenarios and evaluate them on SafeBench. We present the training details in Section A.5 and we provide important findings in the following.

Performance of AD on benign and safety-critical scenarios. The benchmark results of AD algorithms based on 4D inputs are summarized in Table 2. We put more details of benign scenarios in Section A.2. From Table 2, we observe a large performance gap of AD algorithms tested on benign and safety-critical scenarios in SafeBench. For example, although TD3 achieves an overall score of on benign scenarios, it only achieves when testing on safety-critical scenarios. In general, agents that perform well in benign scenarios usually fail given the safety-critical ones, indicating a trade-off between the performance under benign and safety-critical testing scenarios. For instance, PPO obtains the highest overall score on safety-critical scenarios, while its benign performance is worse than both SAC and TD3. On the other hand, although SAC achieves the highest overall score on benign testing scenarios, its performance under safety-critical ones is the worst. More results on algorithms with other types of input observations can be found in Section A.7.

Comprehensive diagnostic report of AD algorithms in all scenarios. In order to provide a comprehensive understanding of the performance of AD algorithms, we conducted a detailed diagnostic report for each tested algorithm from different perspectives. In particular, we consider three levels of evaluation metrics: Safety, Functionality, and Etiquette, as shown in Table 3 for the 4D-based AD agents. Comprehensive reports of all AD agents are in Section A.8. We observe that different AD algorithms outperform others under different metrics. For instance, on the Safety level, PPO achieves the lowest CR and OR, which means it has a high level of safety and a low accident rate, while its performance on the Etiquette level is relatively low. On the Functionality level, TD3 achieves the highest route following stability, demonstrating its ability to complete given tasks without deviating from the route. On the Etiquette level, SAC and DDPG achieve the lowest ACC and YV respectively, which measure the driving quality. Based on the overall score (OS), PPO is shown to be the best AD algorithm given the weighted average over all metrics.

We also notice a trade-off between functionality level metrics and safety level metrics. From Table 3, we can observe that an agent with strong functionality performance may not be safe regarding the safety level metrics. For instance, the SAC agent achieves the best TS score, which means that it can finish the routes in the shortest time, but its collision rate (CR) is also the highest among all the other agents. Similarly, the PPO agent that achieves the best route completion (Comp) score presents, however, the highest RR and SS scores, which means that it may run red lights and stop signs most frequently. This observation suggests the inherent contradiction between some safety metrics and functionality metrics, which is also unveiled in some previous studies (Ray et al., 2019; Liu et al., 2020, 2022).

Model Safety Level Functionality Level Etiquette Level OS
CR RR SS OR RF Comp TS ACC YV LI
DDPG (4D) 0.780 0.089 0.087 12.619 0.504 0.466 20.860 2.488 0.405 5.764 0.489
SAC (4D) 0.829 0.216 0.146 3.115 0.882 0.648 16.827 1.830 0.704 2.580 0.499
TD3 (4D) 0.783 0.231 0.141 2.535 0.903 0.670 17.644 2.680 1.493 2.545 0.516
PPO (4D) 0.603 0.287 0.150 0.099 0.901 0.751 18.021 2.461 1.506 3.528 0.606
Table 3: Diagnostic report. We test every AD algorithm on all selected testing scenarios and report the evaluation results on three different levels. CR: collision rate, RR: frequency of running red lights, SS: frequency of running stop signs, OR: average distance driven out of road, RF: route following stability, Comp: average percentage of route completion, TS: average time spent to complete the route, ACC: average acceleration, YV: average yaw velocity, LI: frequency of lane invasion, OS: overall score, /: higher/lower the better.

5 Conclusion

In this paper, we introduce SafeBench, the first unified platform to automatically evaluate and analyze the performance of AD algorithms in multiple aspects using various safety-critical driving scenarios generated by different generation algorithms. We incorporate safety-critical scenarios and evaluation metrics from different levels to provide a detailed diagnostic report for each AD agent. AD algorithms tested on SafeBench have a large performance drop compared to evaluations on benign scenarios, suggesting the deficiencies of each algorithm and the effectiveness of our testing platform. We hope our platform and findings will serve as a reliable and comprehensive benchmark to help researchers and practitioners to identify weaknesses in existing AD systems and further develop safe AD algorithms as well as more effective testing scenario generation algorithms.

References

  • N. H. T. S. Administration (2017) Automated driving systems 2.0: a vision for safety. Washington, DC: US Department of Transportation, DOT HS 812, pp. 442. Cited by: §1.
  • F. Agostinelli, S. McAleer, A. Shmakov, and P. Baldi (2019) Solving the rubik’s cube with deep reinforcement learning and search. Nature Machine Intelligence 1 (8), pp. 356–363. Cited by: §1.
  • M. Arief, Z. Huang, G. K. S. Kumar, Y. Bai, S. He, W. Ding, H. Lam, and D. Zhao (2020) Deep probabilistic accelerated evaluation: a certifiable rare-event simulation methodology for black-box autonomy. arXiv preprint arXiv:2006.15722. Cited by: §1.
  • G. Bagschik, T. Menzel, and M. Maurer (2018) Ontology based scene creation for the development of automated vehicles. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1813–1820. Cited by: §2.
  • J. Bucklew (2013) Introduction to rare event simulation. Springer Science & Business Media. Cited by: §1.
  • [6] (2022) California Department of Motor Vehicle Disengagement Report. Note: https://www.dmv.ca.gov/portal/vehicle-industry-services/autonomous-vehicles/disengagement-reports/[Online] Cited by: §1.
  • A. Calò, P. Arcaini, S. Ali, F. Hauer, and F. Ishikawa (2020) Generating avoidable collision scenarios for testing autonomous driving systems. In 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST), pp. 375–386. Cited by: §1.
  • Y. Cao, N. Wang, C. Xiao, D. Yang, J. Fang, R. Yang, Q. Chen, M. Liu, and B. Li (2021) Invisible for both camera and lidar: security of multi-sensor fusion based perception in autonomous driving under physical-world attacks. In 2021 IEEE Symposium on Security and Privacy (SP), Vol. , Los Alamitos, CA, USA, pp. 1302–1320. External Links: ISSN 2375-1207, Document, Link Cited by: §1.
  • Y. Cao, N. Wang, C. Xiao, D. Yang, J. Fang, Q. A. Chen, and B. Li (2020) 3D adversarial object against msf-based perception in autonomous driving. MLSys. Cited by: §1.
  • N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §1.
  • J. Chen, S. E. Li, and M. Tomizuka (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems. Cited by: §4.1, §4.1.
  • J. Chen, B. Yuan, and M. Tomizuka (2019) Model-free deep reinforcement learning for urban autonomous driving. In 2019 IEEE intelligent transportation systems conference (ITSC), pp. 2765–2771. Cited by: §4.1, §4.1.
  • S. R. Contributors (2019) Carla Scenario Runner. GitHub. Note: https://github.com/carla-simulator/scenario_runner Cited by: §3.3.2.
  • W. Ding, B. Chen, B. Li, K. J. Eun, and D. Zhao (2021a) Multimodal safety-critical scenarios generation for decision-making algorithms evaluation. IEEE Robotics and Automation Letters 6 (2), pp. 1551–1558. Cited by: §2.
  • W. Ding, B. Chen, M. Xu, and D. Zhao (2020a) Learning to collide: an adaptive safety-critical scenarios generating method. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2243–2250. Cited by: §3.3.1, §4.2.
  • W. Ding, B. Li, K. J. Eun, and D. Zhao (2021b) Semantically controllable scene generation with guidance of explicit knowledge. arXiv preprint arXiv:2106.04066. Cited by: §2.
  • W. Ding, W. Wang, and D. Zhao (2018) A new multi-vehicle trajectory generator to simulate vehicle-to-vehicle encounters. arXiv preprint arXiv:1809.05680. Cited by: §2.
  • W. Ding, M. Xu, and D. Zhao (2020b) Cmts: a conditional multiple trajectory synthesizer for generating safety-critical driving scenarios. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 4314–4321. Cited by: §2.
  • A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. In Conference on robot learning, pp. 1–16. Cited by: §1, §3.1, §3.3.2.
  • I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song (2017) Robust physical-world attacks on machine learning models. arXiv preprint arXiv:1707.08945. Cited by: §1.
  • S. Feng, X. Yan, H. Sun, Y. Feng, and H. X. Liu (2021) Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment. Nature communications 12 (1), pp. 1–14. Cited by: §2.
  • S. Fujimoto, H. Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. Cited by: §4.1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. Cited by: §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 1026–1034. Cited by: §1.
  • L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. Tygar (2011) Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and Artificial Intelligence, pp. 43–58. Cited by: §1.
  • S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel (2017a)

    Adversarial attacks on neural network policies

    .
    arXiv preprint arXiv:1702.02284. Cited by: §1.
  • Z. Huang, M. Arief, H. Lam, and D. Zhao (2018a) Synthesis of different autonomous vehicles test approaches. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2000–2005. Cited by: §1.
  • Z. Huang, Y. Guo, M. Arief, H. Lam, and D. Zhao (2018b) A versatile approach to evaluating and testing automated vehicles based on kernel methods. In 2018 Annual American Control Conference (ACC), pp. 4796–4802. Cited by: §1.
  • Z. Huang, H. Lam, and D. Zhao (2017b) Sequential experimentation to efficiently test automated vehicles. In Proceedings of the 2017 Winter Simulation Conference (WSC), pp. 3078–3089. Cited by: §1.
  • Z. Huang, H. Lam, and D. Zhao (2018c) Rare-event simulation without structural information: a learning-based approach. In 2018 Winter Simulation Conference (WSC), pp. 1826–1837. Cited by: §1.
  • B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez (2021) Deep reinforcement learning for autonomous driving: a survey. IEEE Transactions on Intelligent Transportation Systems. Cited by: §4.1.
  • C. Knies and F. Diermeyer (2020) Data-driven test scenario generation for cooperative maneuver planning on highways. Applied Sciences 10 (22), pp. 8154. Cited by: §2.
  • M. Kratsios (2019) The national artificial intelligence research and development strategic plan: 2019 update. National Science and Technology Council (US). Cited by: §1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §4.1.
  • Z. Liu, Z. Guo, Z. Cen, H. Zhang, J. Tan, B. Li, and D. Zhao (2022) On the robustness of safe reinforcement learning under observational perturbations. arXiv preprint arXiv:2205.14691. Cited by: §4.3.
  • Z. Liu, H. Zhou, B. Chen, S. Zhong, M. Hebert, and D. Zhao (2020) Constrained model-based reinforcement learning with robust cross-entropy method. arXiv preprint arXiv:2010.07968. Cited by: §4.3.
  • M. Matheny, S. T. Israni, M. Ahmed, and D. Whicher (2019) Artificial intelligence in health care: the hope, the hype, the promise, the peril. Washington, DC: National Academy of Medicine. Cited by: §1.
  • W. G. Najm, J. D. Smith, M. Yanagisawa, et al. (2007) Pre-crash scenario typology for crash avoidance research. Technical report United States. National Highway Traffic Safety Administration. Cited by: §1, §3.2.
  • M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi (2018) Scalable end-to-end autonomous vehicle testing via rare-event simulation. Advances in neural information processing systems 31. Cited by: §1.
  • H. Peng (2020) Conducting the Mcity ABC Test: A Testing Method for Highly Automated Vehicles. Technical report Mcity. Cited by: §1.
  • P. Polack, F. Altché, B. d’Andréa-Novel, and A. de La Fortelle (2017) The kinematic bicycle model: a consistent model for planning feasible trajectories for autonomous vehicles?. In 2017 IEEE intelligent vehicles symposium (IV), pp. 812–818. Cited by: §3.3.1.
  • R. Poli, J. Kennedy, and T. Blackwell (2007) Particle swarm optimization. Swarm intelligence 1 (1), pp. 33–57. Cited by: §3.3.2.
  • A. Ray, J. Achiam, and D. Amodei (2019) Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708 7, pp. 1. Cited by: §4.3.
  • B. Ru, A. Cobb, A. Blaas, and Y. Gal (2019) Bayesopt adversarial attack. In International Conference on Learning Representations, Cited by: §3.3.1.
  • A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017 (19), pp. 70–76. Cited by: §4.1.
  • J. M. Scanlon, K. D. Kusano, T. Daniel, C. Alderson, A. Ogle, and T. Victor (2021) Waymo simulated driving behavior in reconstructed fatal crashes within an autonomous vehicle operating domain. Waymo. Cited by: §2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.1.
  • D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/362/6419/1140.full.pdf Cited by: §1.
  • N. Srinivas, A. Krause, S. Kakade, and M. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015–1022. Cited by: §3.3.1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, Cited by: §1.
  • [53] (2020) Uber atg enters two new collaborations with leading us research institutions. Medium. External Links: Link Cited by: §1.
  • E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh (2019) Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125. Cited by: §1.
  • J. Wang, A. Pun, J. Tu, S. Manivasagam, A. Sadat, S. Casas, M. Ren, and R. Urtasun (2021a) AdvSim: generating safety-critical scenarios for self-driving vehicles. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 9909–9918. Cited by: §1, §3.3.1.
  • X. Wang, H. Krasowski, and M. Althoff (2021b) CommonRoad-rl: a configurable reinforcement learning environment for motion planning of autonomous vehicles. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp. 466–472. Cited by: §2.
  • M. Wen, J. Park, and K. Cho (2020) A scenario generation pipeline for autonomous vehicle simulators. Human-centric Computing and Information Sciences 10 (1), pp. 1–15. Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3), pp. 229–256. Cited by: §3.3.1.
  • C. Xiang, C. R. Qi, and B. Li (2019) Generating 3d adversarial point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9136–9144. Cited by: §1.
  • X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter (2018) Commandersong: a systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18), pp. 49–64. Cited by: §1.
  • Q. Zhang, S. Hu, J. Sun, Q. A. Chen, and Z. M. Mao (2022) On adversarial robustness of trajectory prediction for autonomous vehicles. arXiv preprint arXiv:2201.05057. Cited by: §2, §3.3.2.
  • D. Zhao, X. Huang, H. Peng, H. Lam, and D. J. LeBlanc (2018) Accelerated evaluation of automated vehicles in car-following maneuvers. IEEE Transactions on Intelligent Transportation Systems 19 (3), pp. 733–744. Cited by: §1.
  • D. Zhao, H. Lam, H. Peng, S. Bao, D. J. LeBlanc, K. Nobukawa, and C. S. Pan (2016) Accelerated evaluation of automated vehicles safety in lane-change scenarios based on importance sampling techniques. IEEE transactions on intelligent transportation systems 18 (3), pp. 595–607. Cited by: §1.
  • D. Zhao (2016) Accelerated Evaluation of Automated Vehicles. Ph.D. Thesis, University of Michigan, Ann Arbor. Cited by: §1.

Appendix A Appendix

a.1 SafeBench statistics

We present the statistics of testing scenarios generated by each generation algorithm in Table 4. For each algorithm, we report the statistics both before and after scenario selection, where we only keep scenarios that have high transferability across AD algorithms. By applying the generation algorithms, we obtain testing scenarios in total, from which we select testing scenarios for AD evaluation.

a.2 Definition of scenarios and examples of route variants

We first give detailed definitions of the traffic scenarios considered in SafeBench together with screenshots of them in Figure 4.

Straight Obstacle

The ego vehicle encounters an unexpected cyclist or pedestrian on the road and must perform an emergency brake or an avoidance maneuver. As shown in Figure 3(a), the vision of the ego vehicle is usually blocked by an obstacle, which is safety-critical since the reaction time left for the ego vehicle is very short.

Turning Obstacle

As shown in Figure 3(b), while turning at an intersection, the ego vehicle finds an unexpected cyclist or pedestrian on the road and must perform an emergency brake or an avoidance maneuver.

Lane Changing

In this scenario, the ego vehicle should perform a lane changing to evade a leading vehicle, which is moving too slowly. In addition, there is another leading vehicle in the adjacent lane, which is traveling at a normal speed. The ego vehicle needs to avoid hitting both cars when overtaking. See Figure 3(c) for more details.

Vehicle Passing

The ego vehicle must go around a blocking object using the opposite lane, dealing with oncoming traffic. The ego vehicle should avoid colliding with both cars and also avoid driving outside the lane. We provide an example in Figure 3(d).

Red-light Running

When the ego vehicle is going straight at an intersection, a crossing vehicle runs a red light. The ego vehicle is forced to take actions to avoid potential collisions as shown in Figure 3(e).

Unprotected Left-turn

As shown in Figure 3(f), the ego vehicle is performing an unprotected left turn at an intersection while there is a vehicle going straight in the opposite lane.

Right-turn

In this scenario, the ego vehicle is performing a right turn at an intersection, with a crossing vehicle in front. Collision avoidance actions must be taken to keep safe. We present an example in Figure 3(g).

Algo. Scenario Selection Traffic Scenarios Total
Straight Obstacle Left-turn Obstacle Lane Changing Vehicle Passing Red-light Running Unprotected Left-turn Right-turn Crossing Negotiation
LC Before 100 100 100 100 100 100 100 100 800
After 41 13 100 99 42 69 59 58 481
AS Before 100 100 100 100 100 100 100 100 800
After 68 42 100 100 72 86 53 64 585
CS Before 100 100 90 90 90 90 90 90 740
After 60 76 90 90 74 77 83 79 629
AT Before 100 100 100 100 100 100 100 100 800
After 59 33 99 100 100 87 89 90 657
Total Before 400 400 390 390 390 390 390 390 3140
After 228 164 389 389 288 319 284 291 2352
Table 4: Statistics of SafeBench testing scenarios.
(a) Straight Obstacle
(b) Turning Obstacle
(c) Lane Changing
(d) Vehicle Passing
(e) Red-light Running
(f) Unprotected Left-turn
(g) Right-turn
(h) Crossing Negotiation
Figure 4: Pre-crash scenarios.
(a) Two-lane highway
(b) Three-lane bridge
(c) Three-lane bridge with a speed limit sign
Figure 5: Example route variants of scenario 3.
(a) Single-lane T-junction without surrounding buildings
(b) Single-lane T-junction with surrounding buildings
(c) Two-lane intersection
Figure 6: Example route variants of scenario 6.
(a) Single-lane T-junction
(b) Two-lane intersection
(c) Two-lane T-junction
Figure 7: Example route variants of scenario 7.
Crossing Negotiation

In this scenario, the ego vehicle meets another crossing vehicle when passing an intersection with no traffic lights. As shown in Figure 3(h), the ego vehicle should negotiate with the other vehicle to cross the unsignalized intersection in an orderly and safe manner.

We also develop benign scenarios based on these safety-critical scenarios. In benign situations, everything is the same except that the other vehicles are auto-piloted. As a result, we have kinds of benign scenarios and we can compare the benign performances with safety-critical ones.

We show more examples of route variants incorporated in our evaluation platform in Figures 7, LABEL:, 6, LABEL: and 5.

a.3 Evaluation metrics

Symbol Safety Level Functionality Level Etiquette Level
CR RR SS OR RF Comp TS ACC YV LI
1 1 1 50 1 1 60 8 3 20
0.495 0.099 0.099 0.099 0.050 0.050 0.050 0.020 0.020 0.020
Table 5: Constants and weights used in SafeBench evaluation metrics.

We follow the equations introduced in Section 3.4 to calculate evaluation metrics. Specifically, for route following stability, we first set to and then calculate the expectation. For other metrics, we directly calculate the expectation of each variable over the scenario distribution . When calculating the overall score, we follow the maximum allowed value and weights for each metric given in Table 5. The weight for each metric depends on the evaluation level. Metrics in Safety Level are assigned the highest weights since they focus on serious violations of traffic rules. Among the safety level metrics, the weight of CR is times larger than others’ weights. The weights of metrics in Functionality Level are one-half of the weights in Safety Level, while the weights in Etiquette Level are only one-fifth of them. Such a weight setup first emphasizes safety and then encourages the ego vehicle to complete the given tasks in a comfortable way.

a.4 Implementation details of AD algorithms

Reward function

During training, all RL algorithms share the same reward function. The reward is a weighted sum of items. We set the weight of longitudinal speed to , the weight of lateral acceleration to , and the weight of steering to . If the ego vehicle encounters a collision or drives out of lane, we give a reward as a penalty. If the speed of the ego vehicle is larger than a threshold, we give a reward as a penalty. The speed threshold is set to . We also add a constant reward .

Action space

Similarly, the action space of every RL model is the same, which includes acceleration and a steering value. For acceleration, the maximum and minimum allowed values are and , respectively. We limit the absolute value of steering to no greater than . After having the acceleration and steering, we need to convert these values into Carla’s vehicle control format, where we need to calculate the throttle and brake of the ego vehicle. The throttle and brake are calculated using the following equations:

(2)

where denotes the acceleration given by RL models. Both throttle and brake will be clipped to the interval .

Model Architecture

The model we used for deep RL methods is a simple multi-layer perceptron. The size of the hidden layer is [256, 256]. When adding bird-eye view images or camera images into input information, we use a separate image encoder to extract image features. The encoder is end-to-end trained with the actor network in RL models. We provide more details about the architecture of the image encoder in

Table 6.

Layer Input Channels Output Channels Kernel Size Stride Padding
Convolution Layer 1 3 32 3 2 1
Convolution Layer 2 32 64 3 2 1
Max Pooling Layer 1 64 64 3 3 0
Convolution Layer 3 64 128 3 2 1
Convolution Layer 4 128 256 3 2 1
Max Pooling Layer 256 256 3 2 0
Fully Connect Layer 1 1024 512 - - -
Fully Connect Layer 2 512 256 - - -
Fully Connect Layer 3 256 128 - - -
Table 6: Model architecture of image encoder.
DDPG hyperparameters

The policy learning rate is 0.0003 and the Q-value learning rate is 0.001. The standard deviation for Gaussian exploration noise added to the policy at training time is 0.1. The discount factor is 0.99. The number of models in the Q-ensemble critic is 1.

SAC hyperparameters

The policy learning rate and Q-value learning rate are set to be 0.001. The entropy regularization coefficient, which is equivalent to the inverse of the reward scale in the original SAC paper, is 0.1. The discount factor equals 0.99, and the number of models in the Q-ensemble critic is 2.

TD3 hyperparameters

The policy learning rate and Q-value learning rate are set to 0.001. The standard deviation for Gaussian exploration noise added to the policy at training time is 0.1. The standard deviation for smoothing noise added to noise is 0.2 The limit for the absolute value of smoothing noise is 0.5. Policy update delay is 2. The discount factor is 0.99. The number of models in the Q-ensemble critic is 2.

PPO hyperparameters

The policy learning rate is 0.0003 and the Q-value learning rate is 0.001. The clip ratio of the policy object is 0.2. The target KL divergence is 0.01. We set both actor and critic training iters to be 80. The discount factor is 0.99, and the number of interaction steps is 1000.

a.5 Training details of AD algorithms

All of the 4 deep RL algorithms are trained in Carla town03. Because town03 is the most complex town, with a 5-lane junction, a roundabout, unevenness, a tunnel, and more, according to Carla’s official document. The number of warm-up steps for off-policy methods is 600. The interpolation factor in Polyak averaging for the target network is 0.995. The number of training epochs is different for different algorithms and different input states. For example, SAC with 4D+Cam input is trained for

epochs while DDPG with 4D input state is trained for epochs. We train our RL models on NVIDIA GeForce RTX 3090 GPUs and the training usually takes 1 day. For each trained model, we achieve a stable reward value of around 1500 for one episode.

During scenario generation, we also train a SAC model with 4D input state space as a surrogate model. The training process is the same as other models except that we use a different random seed to produce a different training result.

a.6 Detailed scenario generation results

Metric Algo. Traffic Scenarios Avg.
Straight Obstacle Turning Obstacle Lane Changing Vehicle Passing Red-light Running Unprotected Left-turn Right- turn Crossing Negotiation
CR LC 0.320 0.140 0.560 0.920 0.410 0.630 0.458 0.470 0.489
AS 0.570 0.350 0.650 0.900 0.600 0.820 0.520 0.550 0.620
CS 0.610 0.630 0.322 0.900 0.767 0.756 0.667 0.711 0.670
AT 0.680 0.310 0.700 0.930 1.000 0.850 0.500 0.900 0.734
S-CR LC 0.756 0.923 0.560 0.919 0.833 0.870 0.661 0.793 0.789
AS 0.794 0.595 0.650 0.900 0.833 0.930 0.792 0.797 0.787
CS 0.967 0.684 0.322 0.900 0.932 0.870 0.711 0.797 0.773
AT 0.847 0.485 0.697 0.930 1.000 0.966 0.562 1.000 0.811
Comp LC 0.842 0.934 0.704 0.680 0.805 0.744 0.843 0.780 0.792
AS 0.713 0.928 0.649 0.673 0.740 0.646 0.827 0.762 0.742
CS 0.693 0.874 0.886 0.674 0.656 0.666 0.760 0.680 0.736
AT 0.681 0.938 0.595 0.652 0.535 0.644 0.817 0.583 0.681
S-Comp LC 0.631 0.559 0.704 0.679 0.601 0.647 0.771 0.631 0.653
AS 0.600 0.884 0.649 0.673 0.639 0.595 0.725 0.655 0.678
CS 0.521 0.866 0.886 0.674 0.582 0.614 0.740 0.640 0.690
AT 0.576 0.905 0.596 0.652 0.535 0.594 0.794 0.536 0.649
OS LC 0.765 0.825 0.613 0.451 0.755 0.632 0.630 0.646 0.665
AS 0.654 0.718 0.577 0.465 0.659 0.544 0.599 0.606 0.603
CS 0.629 0.577 0.738 0.464 0.569 0.571 0.520 0.522 0.574
AT 0.600 0.737 0.557 0.455 0.460 0.526 0.607 0.423 0.546
S-OS LC 0.565 0.461 0.613 0.451 0.533 0.518 0.528 0.476 0.518
AS 0.548 0.600 0.577 0.465 0.535 0.492 0.451 0.480 0.518
CS 0.465 0.550 0.738 0.464 0.483 0.519 0.496 0.473 0.524
AT 0.523 0.654 0.558 0.455 0.460 0.471 0.574 0.372 0.508
SR LC 0.410 0.130 1.000 0.990 0.420 0.690 0.590 0.580 0.601
AS 0.680 0.420 1.000 1.000 0.720 0.860 0.530 0.640 0.731
CS 0.600 0.760 1.000 1.000 0.822 0.856 0.922 0.878 0.855
AT 0.590 0.330 0.990 1.000 1.000 0.870 0.890 0.900 0.821
Table 7: Full statistics of scenario generation and selection.

We show the full scenario generation and selection statistics in Table 7. We note that we don’t use any personal information since our experiments are based on Carla simulation. In addition to collision rate (CR), overall score (OS), and the overall selection rate (SR), we also report the average percentage of route completion (Comp) for each scenario before and after selection to measure different algorithms’ ability to influence task performances. We find that AT achieves the lowest Comp and S-Comp, which demonstrate its effectiveness in attacking AD system’s functionality.

a.7 Full benchmark results

We report the performance of all AD algorithms tested on SafeBench in Table 8. We trained AD models with different input state spaces and evaluate their performance in both benign scenarios and safety-critical scenarios. Specifically, we provide the 4D input to all the AD algorithms. For 4D+Dir input state, we provide it to SAC, TD3, and PPO. We also equip SAC and PPO with both 4D+BEV and 4D+Cam state spaces. As shown in the table, we first notice that a large performance gap between evaluation results on benign and safety-critical scenarios always exists no matter what kind of input information we provide to the AD algorithm, which demonstrates that our testing scenarios can generalize to algorithms with different input. Besides, similar to the results of algorithms with 4D input, we also observe the trade-off between performance on benign and safety-critical scenarios in 4D+BEV and 4D+Cam input state spaces. For instance, when using 4D+Cam as input state space, SAC obtains a better score on benign scenarios while PPO gets a higher score on safety-critical scenarios. Finally, among different agents, PPO with 4D+BEV input achieves the best OS on SafeBench testing scenarios, which indicates potential possible directions for researchers to design their own model architecture and input state space.

State Space Algo. Traffic Scenarios Avg Avg
Straight Obstacle Turning Obstacle Lane Changing Vehicle Passing Red-light Running Unprotected Left-turn Right- turn Crossing Negotiation Benign Safety- critical
4D DDPG 0.545 0.526 0.440 0.501 0.611 0.444 0.411 0.507 0.603 0.498
SAC 0.533 0.474 0.577 0.471 0.482 0.501 0.503 0.432 0.833 0.497
TD3 0.479 0.596 0.477 0.592 0.532 0.525 0.459 0.482 0.830 0.518
PPO 0.761 0.611 0.426 0.432 0.755 0.728 0.605 0.655 0.819 0.622
Dir SAC 0.608 0.591 0.670 0.435 0.624 0.548 0.552 0.522 0.752 0.569
TD3 0.728 0.543 0.499 0.451 0.665 0.595 0.645 0.590 0.848 0.590
PPO 0.506 0.526 0.601 0.428 0.558 0.474 0.487 0.568 0.628 0.518
BEV SAC 0.501 0.567 0.647 0.446 0.486 0.521 0.449 0.434 0.840 0.506
PPO 0.818 0.632 0.555 0.393 0.918 0.664 0.729 0.847 0.731 0.694
Cam SAC 0.634 0.570 0.436 0.427 0.481 0.529 0.527 0.425 0.812 0.504
PPO 0.542 0.503 0.407 0.425 0.928 0.519 0.579 0.808 0.613 0.589
Table 8: The performance of all AD algorithms tested on SafeBench. We evaluate algorithms using different state spaces. We report the average overall score (OS) on testing scenarios generated by all the scenario generation algorithms with driving route variations. Benign indicates the performance of AD algorithms tested on normal driving scenarios. The last two columns show the OS averaged over all benign and safety-critical scenarios. Dir: 4D+Dir, BEV: 4D+BEV, Cam: 4D+Cam.

a.8 Full diagnostic report

State Space Algo. Safety Level Functionality Level Etiquette Level OS
CR RR SS OR RF Comp TS ACC YV LI
4D DDPG 0.780 0.089 0.087 12.619 0.504 0.466 20.860 2.488 0.405 5.764 0.489
SAC 0.829 0.216 0.146 3.115 0.882 0.648 16.827 1.830 0.704 2.580 0.499
TD3 0.783 0.231 0.141 2.535 0.903 0.670 17.644 2.680 1.493 2.545 0.516
PPO 0.603 0.287 0.150 0.099 0.901 0.751 18.021 2.461 1.506 3.528 0.606
Dir SAC 0.676 0.209 0.152 5.658 0.740 0.705 23.386 1.892 0.640 4.565 0.558
TD3 0.655 0.270 0.144 0.885 0.887 0.718 18.899 2.417 1.187 4.694 0.579
PPO 0.739 0.045 0.077 17.607 0.685 0.534 21.336 2.911 0.893 4.875 0.513
BEV SAC 0.782 0.229 0.141 6.057 0.883 0.674 17.863 2.952 1.566 4.448 0.506
PPO 0.416 0.262 0.151 2.180 0.782 0.756 30.651 2.592 1.290 7.319 0.679
Cam SAC 0.829 0.261 0.149 0.014 0.926 0.637 15.480 4.354 1.885 6.139 0.485
PPO 0.600 0.050 0.127 15.101 0.708 0.599 31.914 2.631 0.827 6.327 0.576
Table 9: Diagnostic report of all AD algorithms tested on SafeBench. We test AD algorithms with different state spaces on all selected testing scenarios and report the evaluation results on three different levels. Dir: 4D+Dir, BEV: 4D+BEV, Cam: 4D+Cam.

In this section, we provide the diagnostic report of all AD algorithms tested on SafeBench. We evaluate different combinations of input state spaces and RL algorithms on different levels of evaluation metrics. Results are shown in Table 9. We find that PPO achieves the highest OS in most cases of input, with the highest score of with 4D+BEV state space. In addition, regarding the collision rate, by comparing agents with different input state spaces, we notice that AD algorithms with 4D input have the highest CR, while algorithms with 4D+BEV input get the lowest CR, which indicates that BEV is the most helpful information for AD systems to drive safely. Finally, we also observe the trade-off between functionality level metrics and safety level metrics with state spaces other than 4D, which means agents that perform well at the functionality level may not be safe regarding the safety level metrics. For example, with 4D+BEV input, PPO achieves lower CR than SAC, while its Comp is also lower than SAC. A similar phenomenon can also be found with 4D+Cam input state space.