Efficient and Effective Generation of Test Cases for Pedestrian Detection – Search-based Software Testing of Baidu Apollo in SVL

by   Hamid Ebadi, et al.

With the growing capabilities of autonomous vehicles, there is a higher demand for sophisticated and pragmatic quality assurance approaches for machine learning-enabled systems in the automotive AI context. The use of simulation-based prototyping platforms provides the possibility for early-stage testing, enabling inexpensive testing and the ability to capture critical corner-case test scenarios. Simulation-based testing properly complements conventional on-road testing. However, due to the large space of test input parameters in these systems, the efficient generation of effective test scenarios leading to the unveiling of failures is a challenge. This paper presents a study on testing pedestrian detection and emergency braking system of the Baidu Apollo autonomous driving platform within the SVL simulator. We propose an evolutionary automated test generation technique that generates failure-revealing scenarios for Apollo in the SVL environment. Our approach models the input space using a generic and flexible data structure and benefits a multi-criteria safety-based heuristic for the objective function targeted for optimization. This paper presents the results of our proposed test generation technique in the 2021 IEEE Autonomous Driving AI Test Challenge. In order to demonstrate the efficiency and effectiveness of our approach, we also report the results from a baseline random generation technique. Our evaluation shows that the proposed evolutionary test case generator is more effective at generating failure-revealing test cases and provides higher diversity between the generated failures than the random baseline.



There are no comments yet.


page 1

page 5

page 6


ViSTA: a Framework for Virtual Scenario-based Testing of Autonomous Vehicles

In this paper, we present ViSTA, a framework for Virtual Scenario-based ...

RMT: Rule-based Metamorphic Testing for Autonomous Driving Models

Deep neural network models are widely used for perception and control in...

Szenario-Optimierung für die Absicherung von automatisierten und autonomen Fahrsystemen

The verification and validation of automated and autonomous driving syst...

An Agency-Directed Approach to Test Generation for Simulation-based Autonomous Vehicle Verification

Simulation-based verification is beneficial for assessing otherwise dang...

scenoRITA: Generating Less-Redundant, Safety-Critical and Motion Sickness-Inducing Scenarios for Autonomous Vehicles

There is tremendous global enthusiasm for research, development, and dep...

Automated Test Cases Prioritization for Self-driving Cars in Virtual Environments

Testing with simulation environments helps to identify critical failing ...

VPPS-ART: An Efficient Implementation of Fixed-Size-Candidate-Set Adaptive Random Testing using Vantage Point Partitioning Strategy

As an enhanced version of Random Testing (RT), Adaptive Random Testing (...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The capabilities of autonomous vehicles have increased remarkably in recent years. A self-driving car is arguably the most tangible example of what the European Commission (EC) defines as an Artificial Intelligence (AI) system 


. From an AI perspective, the automotive industry has successfully harnessed the disruptive potential of machine learning over the last decade. Driven by the availability of big data and computing power, deep neural networks (DNNs) have enabled new levels of vehicular perception. However, performing effective quality assurance of systems that rely on DNNs requires a paradigm shift 

[borg2019safely]. No longer do human engineers explicitly express all logic of the system in source code. Instead, DNNs are trained using enormous quantities of manually annotated data and perform actions probabilistically based on patterns observed in that data. The research community has put substantial effort into making DNN-based systems trustworthy in the automotive AI context, spurring major R&D projects and global safety standardization efforts.

The concept of Trustworthy AI receives particular attention in the EC’s AI Strategy [ec_ai_strategy]. EC defines AI systems as “software (and possibly also hardware) systems designed by humans that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from this data and deciding the best action(s) to take to achieve the given goal” [ec_ai_def]. Novel ways to test AI systems, including autonomous vehicles, are urgently needed—and the research community has taken up the challenge [zhang2020machine, riccio2020testing].

The use of virtual prototyping platforms for automotive software engineering has rapidly grown in recent years [bock2019status]. The use of virtual methods allows testing and validation at early development stages, which leads to fewer development cycles and faster time-to-market. Simulation-based testing is required to complement conventional on-road testing due to severe drawbacks in the use of on-road testing [koopman2016challenges], i.e., system testing on public roads is costly and does not scale to the quantity of scenarios needed—in addition, it can be dangerous to provoke a critical situation on the road. Testing autonomous vehicles in simulators is fundamental to quality assurance in the automotive sector—as indicated in the evolving standard ISO 21448 Safety of the Intended Functionality [international_organization_for_standardization_road_2019].

Efficient and effective testing in simulated environments requires sophisticated approaches to automatically generating test cases. Several authors have demonstrated that search-based software test generation (SBST) [mcminn2011search] is a feasible approach to generate critical test scenarios in the automotive context [ben_abdessalem_testing_2016, ben_abdessalem_testing_2018, gambi_generating_2019, borg2021digital, Moghadam2021Deeper], i.e., test scenarios that lead to the violation of safety requirements. SBST formulates test input selection as a search problem, where optimization algorithms attempt to systematically identify the test input that meet goals of interest. Given a scoring function denoting closeness to the attainment of those goals—called objective function—optimization algorithms can sample from a large and complex set of test inputs as guided by a chosen sampling strategy (a metaheuristic

—in our case, a genetic algorithm


In the 2021 IEEE Autonomous Driving AI Test Challenge competition, our contribution, ScenarioGenerator, uses SBST to generate test scenarios that cause the Baidu Apollo’s autonomous driving platform to fail. While different scenarios can be tested using ScenarioGenerator, for the purpose of this work, we assume a scenario with a pedestrian crossing a street with the following high-level safety goal: “The ego car shall not crash into pedestrians on collision course.” We refer to any crashes between an ego car and pedestrians as a safety violation or failure.

Our work relies on a test strategy involving the following steps of simulation-based automotive testing using SBST. We:

  1. Build a scene in the virtual environment.

  2. Define the parameters involved in creating a varied set of test cases.

  3. Define ranges for each parameter, representing the test input space to explore.

  4. Define an objective function that measures the quality of a generated test case, in terms of its potential to demonstrate a safety violation. In our case, lower scores indicate more dangerous scenarios.

  5. Apply a genetic algorithm to generate test cases that minimize the objective function, leading to safety-critical scenarios.

To accomplish this, we first import a pre-existing map into the SVL Visual Scenario Editor and create an initial movement path for a pedestrian using fixed waypoints—a set of coordinates (points) showing the initial path of the pedestrian’s movements. Then, during the simulation, in the designed scene, the ego car moves forward toward a target and a pedestrian crosses the road from the right.

The proposed evolutionary test case generation formulates the search space using a generic noise vector

data structure and minimizes a multi-criteria objective function that combines (1) distances between the ego car and other road agents, (2) the distance of the journey taken by the ego car towards the target, and (3), the number of accidents detected. Using the noise vector, as a generic and flexible structure for representing the search space of the problem, facilitates the use of a wide variety of search algorithms. This paper presents the results of our proposed test case generation technique in the 2021 IEEE Autonomous Driving AI Test Challenge. To provide the comparative results and demonstrate the efficiency and effectiveness of our evolutionary text case solution, we also compare our results to random generation of test scenarios.

The rest of the paper is organized as follows: Section II presents the details of the proposed search-based test case generation approach. Section III elaborates on the empirical evaluation, including the research method, test scenario execution and experiment setup, results, and threats to the validity of the results. Section V presents an overview of related work, and Section VI summarizes our findings in light of the importance of simulation-based testing of autonomous vehicles and potential research directions for future work.

Ii Search-based Test Case Generation

This section presents how we use an evolutionary search-based technique to generate test cases. Since each scenario takes a few seconds to execute, it is not feasible to try all possible test scenarios. Our approach is to adopt a generic data structure, i.e., a data vector called a “noise vector”, to represent the test input domain for producing test scenarios. Each element of this vector represents a parameter that defines a test scenario, e.g., waypoints, illumination, and weather. The values of these parameters do not lie within the same range, so to bind the values within a specific range, the input representation also scales the concrete real values to values within the range . The values in the noise vector are manipulated by the search algorithm to produce test cases. In our approach, we use a genetic algorithm to explore the search space and produce test cases that are judged as more valuable using an objective function based on potential pedestrian collisions.

Ii-a Scenario Creation and Manipulation

We use SVL Visual Scenario Editor as the first step to create a basic scheme of the test scenarios that are going to be executed by SVL simulator. SVL Visual Scenario Editor is a GUI application that can be used to create basic scenarios specifying where agents (pedestrians, vehicles, ego vehicle, etc.) are positioned in a map and the basic scheme of the path that they should take through the map, which is specified in the form of waypoints.

The basic scenario is created and exported from SVL Visual Scenario Editor to SVL simulator. This scenario is then manipulated by ScenarioGenerator to produce new test scenarios. In ScenarioGenerator, a derived test scenario is specified by a vector of real numbers, the noise vector, with values between and .

Ii-B Scenario specification

A test scenario is defined as a set of parameters used for test scenario generation, i.e., modeling the test inputs, which is shown as follows:


Where indicates a test scenario and denotes a test input parameter. The values of the test input parameters often vary over different ranges. and represent the upper and lower boundaries of the value range for parameter .

For example, the scenario may define a variable representing the time of day. In the base scenario, the time of day may be defined as 12:00. and are used to limit the change in this value in a generated test scenario (e.g., values of and would allow the time to vary from 7:00 to 17:00). The values of parameters representing the positions of the agents would have different ranges—e.g., the position points in a path that the vehicle takes may change by 2 (meters).

Ii-C Noise vector

The proposed representation for a test case is a vector, which is defined as follows:


where each element, , corresponds to a test input parameter, , and the values of components of the noise vector are scaled to values in using a linear scaling function to create a test scenario, .


This transformation allows the use of a generic representation that can be uniformly manipulated by the test generator without detailed knowledge of each input parameter. All elements of the noise vector fall within the range , and are scaled appropriately using and for that .

For example, a noise vector value of 0.5 for the entry representing the time of day, , would result in the following concrete value in a test case: , or 14:00.

Ii-D Objective Function

In order to generate valuable test scenarios, we must identify scenarios that are more likely to lead to safety violations. Safety violations can occur then the ego car moves toward its target at a reasonable speed. Specifically, the objectives to be optimized are as follows:

  • The total distance111Euclidean distance

    of the ego vehicle from other non-ego traffic during scenario execution. This objective should be minimized—we want to examine ego vehicle behavior in potentially dangerous scenarios.

  • The total distance of the journey. This should be maximized, as longer journeys are preferred.

  • : the number of accidents. This should also be maximized, as we seek failures in ego vehicle behavior.

Since the aforementioned objectives do not conflict with each other, we merge them to form a single objective function. This function is minimized—lower scores are preferred. The objective function that we seek to minimize is defined as:


We put high values on the number of accidents, as we are interested in generating test scenarios leading to crashes.

Ii-E Search Algorithm

It is not possible to execute every possible test scenario that can be defined by an instance of the noise vector. Instead, we seek a systematic means to sample from the space of possible scenarios in search of those that could lead to safety violations. This can be done by using an optimization algorithm to sample the space, as guided by the objective function.

The optimization algorithm used to minimize the objective function is a Genetic Algorithm (GA). Genetic Algorithms are modeled on the evolution of a population over time. Initially, a random population of solutions (noise_vector instances) is generated. Then, at each generation, a new population is formed based on the best solutions resulting from the previous generations of evolution. This population is formed by:

  • Identifying good solutions using tournament selection, where a subset of the population is selected at random and the best member of the subset is identified.

  • Breeding “child” solutions by combining elements of “parent” solutions through crossover, where the child solutions are formed by selecting genes (elements) from each parent solution.

  • Introducing mutations into the population by making small, random adjustments to solutions.

Tournament selection is performed to identify parent solutions, then crossover and mutation are performed at user-set probabilities. Either, or both, may be applied to transform the identified solutions. Finally, the resulting solutions are added to the new population. This process continues until a new population is formed. The objective function is calculated for each member of this population, and the score is stored for that solution. This process is performed each generation, until a user-set number of generations has been exhausted. At the end, the best solutions are returned.

In our case, we have three objectives—, , and , which have been merged into a single formula. Tournament selection picks the best solution among the solutions in each tournament. The number of individuals participating in each tournament denotes the size of the tournament. In our approach, we omit the crossover operation, as the noise vector contains the values for the parameters of the test scenarios in a certain order, and crossover could violate this ordering. Instead, we apply mutation with a high probability. We use Polynomial Bounded mutation, as proposed and implemented in NSGA-II [deb2002fast]

. It is a bounded mutation operation for real-valued parameters and uses a polynomial function for the probability distribution. It uses a parameter,

indicating the crowding degree of the mutation, which is used to encourage diversity in the resulting population. A high yields a mutant resembling the original solution, while a small value for produces a solution more divergent from the original. The GA algorithm used for generating test scenarios is configured as presented in Algorithm 1.

1 Initialize population with solutions from random seeds; Evaluate the population; repeat
2       1. Select offspring using Tournament Selection with replacement; 2. Mutate the resulting offspring using Polynomial Bounded mutation operation with a certain probability (mutation rate = 0.95); 3. Evaluate the offspring using the objective function.
3until meeting the stopping criteria (reaching the maximum number of generations or other limitations specified in the test budget);
Algorithm 1 GA for Test Scenario Generation
Fig. 1: Overview of the experimental setup.

Iii Implementation and Empirical Evaluation

We perform an empirical evaluation of the proposed test case generation technique, ScenarioGenerator222Available from https://github.com/ebadi/ScenarioGenerator. by running experiments on our experimental setup on a desktop PC with the following specifications:

  • Ubuntu version 18.04

  • Intel Core i7-10700K CPU @ 3.80GHz × 16

  • 32GB RAM

  • GeForce RTX 2070 SUPER/PCIe/SSE2

  • SVL simulator 2021.1 (linux64) with modular testing setup (3D Ground Truth sensor and Signal sensor publish ground truth perception data to Apollo via CyberRT bridge)

  • Baidu Apollo (r6.0.0 branch)

The experiments are simulations that are controlled by a Python scenario runner which uses our test case generation technique for generating the scenarios in the simulation environment. Baidu Apollo is the autonomous driving software platform that controls the ego vehicle. It connects to the simulator through its customized bridge and drives the ego vehicle (Fig. 1).

We design a set of experiments to assess the efficiency and effectiveness of the proposed test case generation for testing Apollo in the SVL simulation environment. Pedestrian detection and proper responding is the target use case of Apollo in our experiments. For a comparative analysis, we also report results from a random testing technique as a baseline approach. In random testing, the test cases are generated randomly, which means that the set of noise vector instances are generated by setting the test input parameters to random values within the allowed range. The target is to generate the highest number of diverse valid test cases leading to failures, i.e., collisions between the ego vehicle and pedestrians. We use the following quality criteria for evaluating the proposed test case generation technique:

  • Detected Failures: The number of test cases that lead to a collision.

  • Failure Diversity: The dissimilarity between the generated test cases leading to failures. We are interested in generating diverse test cases, as triggering similar failures leads to waste of the test budget, e.g., computation resources. To measure failure diversity, we use the Euclidean distance between failure-revealing noise vectors.

Iii-a Test Scenario Execution

The testing budget (including, e.g., execution time) is a limited resource. While not as expensive to perform as on-road testing, running test scenarios in simulators also takes time. In our experiments, each scenario takes about 10 seconds to execute and evaluate. Therefore, for the purpose of this competition, we set the limit for the number of simulation executions to 200 in the Genetic Algorithm. This would correspond, for example, to 20 generations with a population size of 10.

(a) Number of detected failures.

(b) Objective values for the average journey distance during failure-revealing test cases.

(c) Objective values for average distance from ego car during failure-revealing test cases.
Fig. 2: Comparisons between GA and random generation.
Fig. 3: Collision between pedestrian and the ego vehicle on a rainy night.

In ScenarioGenerator, the user-controllable parameters for test scenario creation and manipulation are as follows:

  • Initial JSON file created by SVL Visual Scenario Editor.

  • Test case generation strategy, which is used for scenario generation. Currently, Differential Evolution, Powell Optimization, Genetic Algorithm, and random generation strategies are supported. Meanwhile, the capability of replaying a scenario is also supported by passing the JSON file and setting the action to replay. A specific noise vector in combination with replay action can also be used. In this mode, in addition to all the previous parameters, a specific noise vector is given to be played.

  • The ego vehicle destination.

  • Acceptable range of changes in the values for the position of each waypoint .

  • Acceptable range of changes in the color of each agent (r, g, b).

  • Acceptable range of changes in the weather in the simulation (e.g., rain, fog, wetness, cloudiness, road damages).

  • Acceptable range of changes in the time of day.

  • Acceptable range of changes in the speed of each agent.

In a test case, the generated noise vector is used to impose changes to the position of each waypoint, the color of each agent, the weather, the time of day, and the speed of each agent. The base scenario defines a value for each of these parameters. The user-controllable parameters are used to constrain the range of changes made by the noise vector between minimum and maximum values, as discussed in Section II.

Iv Results and Discussion

This section presents the experimental results and assesses the proposed test case generation compared to the random testing with regard to the quality criteria.

Detected Failures: Fig. 2(a) shows the number of detected failures (test cases leading to collisions) by the GA-based test case generation and random testing. The proposed GA-based technique trigger twice as many failures as random testing on the same configuration and test budget, and consequently, in this regard, works more effectively. Fig. 3 also shows a sample of a generated test scenario leading to a collision between the pedestrian and the ego vehicle.

In order to investigate the characteristics of the detected failures, we can examine the values of two of the objectives in the objective function— and . These can show the characteristics of the detected failures. Fig. 2(b) and (c) show the average values of the two objectives in failure-revealing test cases for both techniques. These average values do not differ significantly between the two approaches. This indicates that the GA reveals more failures, but the failures revealed by the two techniques fall in similar objective ranges. However, both distances are somewhat higher in the GA—i.e., the GA generates tests with slightly longer journey distances and a slightly higher distance from the ego car. These tests may be somewhat more interesting for revealing errors in the ego car functionality, as—for example—a longer distance between the ego car and a pedestrian should offer more time to make corrections. In future work, we will examine failing scenarios more closely and discuss them with domain experts.

Failure Diversity: We use pairwise Euclidean distance between the noise vectors to show diversity between the failure-revealing test cases. Fig. 4 and 5 show the average pairwise Euclidean distance for each of the failure test cases generated by GA and random testing respectively. The average pairwise Euclidean distance refers to the average difference between a test case and the other test cases. Table I shows the range of average pairwise Euclidean distance for the failure-revealing test cases from the GA and random testing. In this regard, the GA technique also promotes more diversity between generated failure-revealing test cases than random testing.

Fig. 4: Diversity of failure-revealing test cases generated by the GA.
Fig. 5: Diversity of failure-revealing test cases generated by random testing.
Genetic Algorithm Random
Range of Euclidean Distances
TABLE I: Failure diversity, shown as the range in the average pairwise Euclidean distance for test cases.

Iv-a Threats to Validity

Some of the main sources of threats to validity of the experimental results are as follows:

Internal Validity: During the experiment, we noticed that many of the failures that are captured are not completely reproducible. In fact, the simulation execution often does not produce identical results given identical input parameters and configuration setup. One of the main reasons is that Apollo does not function in a deterministic manner. We tried to mitigate the effects of this by reporting average values from the experiments, and conducting the experiments in a controlled manner, i.e., using the same experimental setup and keeping the user-controllable parameters fixed between executions. Another source of threat is the fact that as the simulator runs a large number of test cases, the simulations become slower and less responsive probably due to performance bottlenecks.

External Validity: We have focused on a single scenario. As we have used a generic data structure consisting of variables scaled in a certain range, i.e., the noise vector with variables within the range , we believe that the representation model and test case generation approach could be used for simulation-based testing of more complex scenes and other use cases. However, the variables in the noise vector might need to be modified (e.g., extended) for different use cases.

V Related Work

Simulators as a form of digital twins play a key role for different purposes in testing and verification, control and monitoring, and improvement of cyber-physical systems (CPS). For ADAS and autonomous-driving cars, this is even more significant and there is a higher demand for high-fidelity simulators. Simulation-based testing is one of the most effective approaches for system-level testing of ADAS and acts as a suitable complementary solution to on-road testing, since it provides the possibility for early stage testing, capturing critical corner test scenarios and enabling inexpensive testing. Field testing of such systems is expensive, inefficient and even dangerous, in some cases. Recently, various simulators such as those ones using physics-based models, e.g., SVL simulator [SVLSimulator], PreScan [PreScan] and Pro-SiVIC [ProSivic_belbachir2012simulation] or the ones relying on game engines, e.g., BeamNG [BeamNG] and CARLA [dosovitskiy2017carla], have been developed to meet the need for realistic simulation of the functions in autonomous driving.

Accordingly, various system-level testing approaches relying on the simulators have been proposed in the recent years. One of the common intended purposes in those studies is generating critical test cases (scenarios) that lead the system to fail. This is a challenging problem, due to the large search space of input parameters in these systems. Covering all possible simulation test scenarios is not feasible in practice. Therefore, in this regard SBST techniques have been widely used to generate effective test simulation scenarios for those systems. In recent studies, multi-objective search algorithms like NSGA-II [ben_abdessalem_testing_2016], many-objective algorithms like MOSA [panichella2015reformulating] using a combination of different objectives based on branch coverage and failure-based heuristics [abdessalem2018testing], and learnable evolutionary algorithms [abdessalem2018testinglearnable] have been used to generate critical test cases leading to violations of safety requirements in autonomous driving cars. Moreover, there have also been studies focusing on the role of simulators and the type of test data. In [haq2020comparing] a comparison between testing of DNN-based ADAS using real-world and simulator-generated data is conducted and it is also showed that how on-line and off-line testing of these systems can differ and meanwhile complement each other. Markus et al. studied the consistency between the results obtained from two different simulators and investigated whether the obtained results could be mutually reproducible in both simulators [borg2021digital].

Vi Conclusion and Future Work

Efficient and effective test case generation for use in virtual environments is essential for testing AI-based automotive systems. In this paper, we presented a SBST approach to generate test scenarios that lead to detection of failures and safety violations of the Baidu Apollo pedestrian emergency braking system. We have made three primary observations. First, our results show that the proposed GA-based test case generation is more effective than random testing, i.e., it is more effective in generating failure revealing test cases and provides higher diversity between the generated test cases compared to random testing. Second, unfortunately, many of the captured failures could not be reproduced given the same configuration and user-controlled parameters due to the non-deterministic nature of Apollo. Third, we see great potential in simulation-based testing of different functions of autonomous driving systems using SVL simulator and Baidu Apollo. In future work, we will broaden the scope of the research into additional safety scenarios. We will also extend SBST approaches with machine learning-based techniques (e.g., reinforcement learning) for test case generation in system-level testing of ADAS. We are also interested in the use of Generative Adversarial Networks (GANs) as a technique for enabling the discovery of failure-revealing test cases.


This project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 876852 (VALU3S). Furthermore, this work received support from the ITEA3 European IVVES project (https://itea3.org/project/ivves.html) and the SMILE III project financed by Vinnova, FFI, Fordonsstrategisk forskning och innovation under the grant numbers 2019-05871 and the AIQ Meta-Testbed project funded by Kompetensfonden at Campus Helsingborg, Lund University, Sweden. Additional support was provided under Vetenskapsrådet grant 2019-05275. The authors would like to thank INFOTIV AB for their support and cooperation.