Deep Neural Networks (DNN) [5, 3, 6] have made unprecedented progress largely fueled by increasing availability of data and computing powers. DNNs have been able to automate challenging real-world tasks such as image classification 23] and speech recognition , making them key enablers of smart and autonomous systems such as automated-driving vehicles. As DNNs are increasingly used in safety critical autonomous systems, the challenge of ensuring safety and reliability of DNN-based systems emerges as a difficult and fundamental software verification problem.
Many DNN testing approaches have been proposed recently [18, 25, 31, 15, 32]. Among these, we distinguish two high-level, distinct approaches to DNN testing: (1) Testing DNNs as stand-alone components, and (2) testing DNNs embedded into a specific application (e.g., an automated-driving system) and in interaction with the application environment. We refer to the former as offline testing and to the latter as online testing. Specifically, in offline testing, DNNs are tested as a unit in an open-loop mode. They are fed with test inputs generated independently from the DNN under test, either manually or automatically (e.g., using image generative methods ). The outputs of DNNs are then typically evaluated by assessing their prediction error, which is the difference between the expected test outputs (i.e., test oracles) and the outputs generated by the DNN under test. In online testing, however, DNNs are tested within an application environment in a closed-loop mode. They receive test inputs generated by the environment, and their outputs are, then, directly fed back into the environment. Online testing evaluates DNNs by monitoring the requirements violations they trigger, for example related to safety.
There have been several offline and online DNN testing approaches in the literature 
. However, comparatively, offline testing has been far more studied to date. This is partly because offline testing does not require the DNN to be embedded into an application environment and can be readily carried out with either manually generated or automatically generated test data. Given the increasing availability of open-source data, a large part of offline testing research uses open-source, manually-generated real-life test data. Online testing, on the other hand, necessitates embedding a DNN into an application environment, either real or simulated. Given the safety critical nature of many systems relying on DNN (e.g., self-driving cars), most online testing approaches rely on simulators, as testing DNNs embedded into real and operational environment is expensive, time consuming and can be dangerous in some cases.
While both offline and online testing approaches have shown to be promising, there is limited insight as to how these two approaches compare with one another. While, at a high-level, we expect offline testing to be faster and less expensive than online testing, we do not know how they compare with respect to their ability to reveal faults, for example leading to safety violations. Further, we would like to know if large prediction errors identified by offline testing always lead to safety violations detectable by online testing? or if the safety violations identified by online testing translate into large prediction errors? Answers to these questions enable us to better know the relationships and the limitations of the two testing approaches. We can then know which approach is to be recommended in practice for testing autonomous systems, or if the two are somehow complementary and should be best combined.
In this paper, though the investigated questions are generally relevant to all autonomous systems, we perform an empirical study to compare DNN offline and online testing in the context of automated driving systems. In particular, our study aims to ultimately answer the following research question: RQ1: How do offline and online testing results differ and complement each other? To answer this question, we use open-source DNN models developed to automate steering functions of self-driving vehicles . To enable online testing of these DNNs, we integrate them into a powerful, high-fidelity physics-based simulator of self-driving cars . The simulator allows us to specify and execute scenarios capturing various road traffic situations, different pedestrian-to-vehicle and vehicle-to-vehicle interactions, and different road-topologies, weather conditions and infrastructures. As a result, in our study offline and online testing approaches are compared with respect to the data generated automatically using a simulator. To ensure that this aspect does not impact the validity of our comparison, we investigate the following research question as a pre-requisite of the above question: RQ0: Can we use simulator-generated data as a reliable substitute to real-world data for the purpose of DNN testing?
To summarize, the main contribution of this paper is that we provide, for the first time, an empirical study comparing offline and online testing of DNNs. Our study investigates two research questions RQ0 and RQ1 (described above) in the context of an automated-driving system. Specifically,
RQ0: Our results show that simulator-generated datasets are able to yield DNN prediction errors that are similar to those obtained by testing DNNs with real-life datasets. Hence, simulator-generated data can be used in lieu of real-life datasets for testing DNNs in our application context.
RQ1: We found that offline testing is more optimistic than online testing because the accumulation of prediction errors over time is not observed in offline testing. Specifically, many safety violations identified by online testing could not be identified by offline testing as they did not cause large prediction errors. However, all the large prediction errors generated by offline testing led to severe safety violations detectable by online testing.
To facilitate the replication of our study, we have made all the experimental materials, including simulator-generated data, publicly available .
The rest of the paper is organized as follows. Section II provides background on DNNs for autonomous vehicles, introduces offline and online testing, describes our proposed domain model that is used to configure simulation scenarios for automated driving systems, and formalizes the main concepts in offline and online testing used in our experiments. Section III reports the empirical evaluation. Section IV surveys the existing research on online and offline testing for automated driving system. Section V concludes the paper.
Ii Offline and Online Testing frameworks
This section provides the basic concepts that will be used throughout the paper.
Ii-a DNNs in ADS
Depending on the ADS design, DNNs may be used in two ways to automate the driving task of a vehicle: One design approach is to incorporate DNNs into the perception layer of ADS primarily to do semantic segmentation 
, i.e., to classify and label each and every pixel in a given image. The software controller of ADS then decides what commands should be issued to the vehicle’s actuators based on the classification results produced by the DNN. An alternative design approach is to use DNNs to perform the end-to-end control of a vehicle  (e.g., Figure 1). In this case, DNNs directly generate the commands to be sent to the vehicles’ actuators after processing images received from cameras. Our approach to compare offline and online testing of DNNs for ADS is applicable to both ADS designs. In the comparison provided in this paper, however, we use DNN models automating the end-to-end control of steering function of ADS since these models are publicly available online and have been extensively used in recent papers on DNN testing [25, 31, 15, 14]. In particular, we use the DNN models from the Udacity self-driving challenge as our study subjects . We refer to this class of DNNs as ADS-DNNs in the remainder of the paper. Specifically, ADS-DNN receives inputs from a camera, and generates a steering angle command.
Ii-B Offline Testing
Figure 2 represents an overview of offline testing of DNN in the context of ADS. In general, a dataset used to test a DNN (or any ML model for that matter) is expected to be realistic to be able to provide an unbiased evaluation of the DNN under test. As shown in Figure 2, we identify two sources for generating test data for the offline mode: (1) datasets captured from real-life driving, and (2) datasets generated by simulators. For our ADS-DNN models, a real-life dataset is a video or a sequence of images captured by a camera mounted on a (ego) vehicle’s dashboard while the vehicle is being driven by a human driver. The steering angle of the vehicle applied by the human driver is recorded for the duration of the video and each image (frame) of the video in this sequence is labelled by its corresponding steering angle. This yields a sequence of manually labelled images to be used for testing DNNs. There are, however, some drawbacks with test datasets captured from real-life. Specifically, data generation is expensive, time consuming and lacks diversity. The latter issue is particularly critical since driving scenes, driving habits, as well as objects, infrastructures and roads in driving scenes, can vary widely across countries, continents, climates, seasons, day times, and even drivers.
As shown in Figure 2, another source of test data generation for DNN offline testing is to use simulators to automatically generate videos capturing various driving scenarios. There are increasingly more high-fidelity and advanced physics-based simulators for self-driving vehicles fostered by the needs of the automotive industry which increasingly relies on simulators to improve their testing and verification practices. There are several examples of commercial ADS simulators (e.g., PreScan  and Pro-SiVIC ) and a number of open source ones (e.g., CARLA  and Apollo ). These simulators incorporate dynamic models of vehicles (including vehicles’ actuators, sensors and cameras) and humans as well as various environment aspects (e.g., weather conditions, different road types, different infrastructures). The simulators are highly configurable and can be used to generate desired driving scenarios. In our work, we use the PreScan simulator to generate test datas for ADS-DNNs. PreScan is a widely-used, high-fidelity commercial ADS simulator in the automotive domain and has been used by our industrial partner. In Section II-D, we present the domain model we define to configure the simulator, and describe how we automatically generate scenarios that can be used to test ADS-DNNs. Similar to real-life videos, the videos generated by our simulator are sequences of labelled images such that each image is labelled by a steering angle. In contrast to real-life videos, the steering angles generated by the simulator are automatically computed based on the road trajectory as opposed to being generated by a human driver.
The simulator-generated test datasets are cheaper and faster to produce compared to real-life ones. In addition, depending on how advanced and comprehensive the simulator is, we can achieve a higher-level of diversity in the simulator-generated datasets by controlling and varying the objects, roads, weather, day time and infrastructures and their various features. However, it is not yet clear whether simulator-generated images can be used in lieu of real images since real images may have higher resolution, showing more natural texture and look more realistic. In this paper, we conduct an empirical study in Section III to investigate if we can use simulator-generated images as a reliable alternative to real images for testing ADS-DNNs.
Ii-C Online Testing
Figure 3 provides an overview of online testing of DNNs in the context of ADS. In contrast to offline testing, DNNs are embedded into a simulator, they receive images generated by the simulator, and their outputs are directly sent to the (ego) vehicle models of the simulator. In this paper, we embed the ADS-DNN into PreScan by providing the former with the outputs from the camera model in input and connecting the steering angle output of the ADS-DNN as input command to the vehicle dynamic model. With online testing, we can evaluate how predictions generated by an ADS-DNN for an image generated at time in a scenario impacts the images to be generated at the time steps after . Specifically, if the ADS-DNN orders the ego vehicle to turn with an angle at time during a simulation, the camera’s field of view will be shifted by within a small time duration , and hence, the image captured at time will account for the modified camera’s field of view. Note that is the time required by the vehicle to actually perform a command and is computed by the dynamic model in the simulator. With online testing, in addition to the steering angle outputs directly generated by the ADS-DNN, we obtain the trajectory outputs of the ego vehicle which enable us to determine whether the car is able to stay in its lane.
Note that one could perform online testing with a real car and collect real-life data. However, this is expensive, very dangerous, in particular for end-to-end DNNs such as ADS-DNN, and can only be done under very restricted conditions on some specific public roads.
We conduct an empirical study in Section III to investigate How offline and online testing results differ and complement each other for ADS-DNNs.
Ii-D Domain Model
Figure 4 shows a fragment of the domain model capturing the test input space of ADS-DNN. To develop the domain model, we relied on the features that we observed in the real-world test datasets for ADS-DNN (i.e., the Udacity testing datasets ) as well as the configurable features of our simulator. The domain model includes different types of road topologies (e.g., straight, curved, with entry or exit lane), different weather conditions (e.g., sunny, foggy, rainy, snowy), infrastructure (e.g., buildings and overhead hangings), nature elements (e.g., trees and mountains), an ego vehicle, secondary vehicles and pedestrians. Each entity has multiple variables. For example, an ego vehicle has the following variables: a speed, a number (id) identifying the lane in which it is driving, a Boolean variable indicating if its fog lights are on or off, and many others. In addition to entities and variables, our domain model includes some constraints describing valid value assignments to the variables. These constraints mostly capture the physical limitations and traffic rules. For example, the vehicle speed cannot be higher than some limit on steep curved roads. We have specified these constraints in the Object Constraint Language (OCL) . The complete domain model, together with the OCL constraints, are available in the supporting materials .
To produce a simulation scenario (or test scenario) for ADS-DNN, we develop an initial configuration
based on our domain model. An initial configuration is a vector of values assigned to the variables in the domain model and satisfying the OCL constraints. The simulator generates for each of the mobile objects defined in a scenario, namely the ego vehicle and secondary vehicles and pedestrians, a vector of the trajectory path of that object (i.e., a vector of values indicating the positions and speeds of the mobile object). The length of the vector is determined by the duration of the simulation. The position values are computed based the characteristics of the static objects, specified by the initial configuration such as roads and sidewalks, as well as the speed of the mobile objects.
|Criterion||Offline testing||Online testing|
|Definition||Test DNNs using historical data already generated manually or automatically||Test DNNs by embedding them into an application environment (virtual or real)|
|Test input||Sequences of images from a camera (real-life) or a camera model (in a simulator)||An initial configuration for a simulator to guide the simulator into generating a specific driving scenario|
|Test output||Prediction error||Safety violation|
Table I summarizes the comparison between offline and online testing as detailed in Sections II-B and II-C. Briefly, offline testing verifies the DNN using historical data consisting of sequences of images captured from real-life camera or based on a camera model of a simulator. In either case, the images are labelled with the steering angles. Offline testing measures the prediction errors of the DNN to evaluate test results. In contrast, online testing verifies the DNN embedded into an application environment in a closed-loop mode. The test inputs for online testing are initial configurations of the simulator, generated based on our domain model (see Section II-D), that guide the generation of specific scenarios. The output of online testing is whether, or not, for a given simulation scenario, a safety violation has happened. In our context, a safety violation happens when the ego car strays out of its lane such that it may risk an accident. Since offline testing relies on historical data, it has a low execution time. However, the time required to perform online testing is relatively high because it encompasses the time required for the DNN-based ADS to execute and interact with its environment. Note that the execution time in Table I only refers to the time required to perform testing and not the time or cost of generating test inputs.
In the remainder of this section, we formalize inputs and outputs for offline and online testing. We denote a real-life test dataset by a sequence of tuples. For , each tuple of consists of an image and a steering angle label. A DNN , when provided with a sequence of the images of , returns a sequence of predicted steering angles. The prediction error of for is, then, computed using two well-known metrics, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), defined below:
To generate a test dataset using a simulator, we provide the simulator with an initial configuration of a scenario as defined in Section II-D. We denote the test dataset generated by a simulator for a scenario for offline testing by .
For online testing, we embed a DNN into a simulator and run the simulator. For each (initial configuration of a) scenario, we execute the simulator for a time duration . The simulator generates outputs as well as images at regular time steps , generating outputs as vectors of size . Each simulator output and image takes an index between to . We refer to the indices as simulation time steps. At each time step , the simulator generates an image to be sent to as input, and generates a predicted steering angle which is sent to the simulator. The status of the ego car is then updated in the next time step (i.e., the time duration it takes to update the car is ) before the next image is generated. In addition to images, the simulator generates the position of the ego car over time. Since our DNN is for an automated lane keeping function, we use the Maximum Distance from Center of Lane (MDCL) metric for the ego car to determine if a safety violation has occurred. The value of MDCL is computed at the end of the simulation when we have the position vector of the ego car over time steps, which was guided by our DNN. We cap the value of MDCL at , indicating that when MDCL is or larger, the ego car has already departed its lane and a safety violation has occurred. To have a range between 0 and 1, MDCL is normalized by dividing 1.5 by the actual distance in meter.
RQ0: Can we use simulator-generated data as a reliable alternative source to real-world data? Recall that in Figure 2, we described two sources for generating test data for offline testing. As discussed there, simulator-generated test data is cheaper and faster to produce and is more amenable to input diversification compared to real-life test data. On the other hand, the texture and resolution of real-life data look more natural and realistic compared to the simulator-generated data. With RQ0, we aim to investigate whether, or not, such differences lead to significant inaccuracies in predictions of the DNN under test. To do so, we configure the simulator to generate a dataset (i.e., a sequence of labelled images) that closely resembles the characteristics of a given real-life dataset. We then compare the offline testing results for these datasets. The answer to this question, which serves as a pre-requisite of our next question, will determine if we can rely on simulator-generated data for testing DNNs in either offline or online testing modes.
RQ1: How do offline and online testing results differ and complement each other? RQ1 is the main research question we want to answer in this paper. It is important to know how the results obtained by testing a DNN irrespective of a particular application compare with test results obtained by embedding a DNN into a specific application environment. The answer will guide engineers and researchers to better understand the applications and limitations of each testing mode.
Iii-a Experimental Subjects
We use two publicly-available pre-trained DNN-based steering angle prediction models, i.e., Autumn  and Chauffeur , that have been widely used in previous work to evaluate various DNN testing approaches [25, 31, 14].
Autumn consists of an image preprocessing module implemented using OpenCV to compute the optical flow of raw images, and a Convolutional Neural Network (CNN) implemented using Tensorflow and Keras to predict steering angles. Chauffeur consists of one CNN that extracts the features of input images and a Recurrent Neural Network (RNN) that predicts steering angles from the previous 100 consecutive images with the aid of a LSTM (Long Short-Term Memory) module. Chauffeur is also implemented by Tensorflow and Keras.
The models are developed using the Udacity dataset , which contains 33808 images for training and 5614 images for testing. The images are sequences of frames of two separate videos, one for training and one for testing, recorded by a dashboard camera with 20 Frame-Per-Second (FPS). The dataset also provides, for each image, the actual steering angle produced by a human driver while the videos were recorded. A positive (+) steering angle represents turning right, a negative (-) steering angle represents turning left, and a zero angle represents staying on a straight line. The steering angle values are normalized (i.e., they are between and ) where a steering angle value indicates , and a steering angle value indicates 111This is how Tian et al.  have interpreted the steering angle values provided along with the Udacity dataset, and we follow their interpretation. We were not able to find any explicit information about the measurement unit of these values anywhere else.. Figure 5 shows the actual steering angle values for the sequence of 5614 images in the test dataset. We note that the order of images in the training and test datasets matters and is accounted for when applying the DNN models. As shown in the figure, the steering angles issued by the driver vary considerably over time. The large steering angle values (more than ) indicate actual road curves, while the smaller fluctuations are due to the natural behavior of the human driver even when the car drives on a straight road.
Table II shows the RMSE and MAE values obtained by applying the two models to the Udacity test dataset. Note that we were not able to exactly replicate the RMSE value reported on the Udacity self-driving challenge website , as the value in Table II
is slightly different from those provided by Udacity. Reproducibility is known to be a challenge for state-of-the-art deep learning methods since they involve many parameters and details whose variations may lead to different results. To enable replication of our work, we have made our detailed configurations (e.g., python and auxiliary library versions), together with supporting materials, available online .
|Model||Reported RMSE||Our RMSE||Our MAE|
While MAE and RMSE are two of the most common metrics used to measure prediction errors for learning models with continuous variable outputs, we mainly use MAE throughout this paper because, in contrast to RMSE, the MAE values can be directly compared with individual steering angle values. For example, means that the average prediction error of for the images in is (). Since MAE is a more intuitive metric for our purpose, we will only report MAE values in the remainder of our paper.
Iii-B RQ0: Comparing Offline Testing Results for Real-life Data and Simulator-generated Data
We aim to generate simulator-generated datasets closely mimicking the Udacity real-life test dataset and verify whether the prediction errors obtained by applying DNNs to the simulator-generate datasets are comparable with those obtained for their corresponding real-life ones. As explained in Section III-A, our real-life test dataset is a sequence of 5614 images labelled by their corresponding actual steering angles. If we could precisely extract the properties of the environment and the dynamics of the ego vehicle from the real-life datasets in terms of initial configuration parameters of the simulator, we could perhaps generate simulated data resembling the real-life videos with high accuracy. However, extracting information from real-life video images in a way that the information can be used as inputs of a simulator is not possible.
Instead, we propose a two-step heuristic approach to replicate the real-life dataset using our simulator. Basically, we steer the simulator to generate a sequence of images similar to the images in the real-life dataset such that the steering angles generated by the simulator are also close to the steering angle labels in the real-life dataset.
In the first step, we observe the test dataset and manually identify the information in the images that correspond to some configurable parameter values in our domain model described in Section II-D. We then create a restricted domain model by fixing the parameters in our domain model to the values we identified by observing the images in the Udacity test dataset. This enables us to steer the simulator to resemble the characteristics of the images in the test dataset to the extent possible. Our restricted domain model includes the entities and attributes that are neither gray-colored nor bold in Figure 4. For example, the restricted domain model does not include weather conditions other than sunny because the test dataset has only sunny images. This guarantees that the simulator-generated images based on the restricted domain model represent sunny scenes only. Using the restricted domain model, we randomly generate a large number of scenarios yielding a large number of simulator-generated datasets.
In the second step, we aim to ensure that the datasets generated by the simulator have similar steering angle labels as the labels in the real-life dataset. To ensure this, we match the simulator-generated datasets with (sub)sequences of the Udacity test dataset such that the similarities between their steering angles are maximized. Note that steering angle is not a configurable variable in our domain model, and hence, we could not force the simulator to generate data with specific steering angle values as those in the test dataset by restricting our domain model. Hence, we minimize the differences by selecting the closest simulator-generated datasets from a large pool of randomly generated ones. To do this, we define, below, the notion of “comparability” between a real-life dataset and a simulator-generated dataset in terms of steering angles.
Let be a set of randomly generated scenarios using the restricted domain model, and let be the Udacity test dataset where . We denote by a subsequence of with length starting from index where . For a given simulator-generated dataset corresponding to a scenario , we compute using the following three conditions:
where returns222If has multiple points of the minima, one of them is randomly returned. minimizing , and is a small threshold on the average steering angle difference between and . We say datasets and are comparable if and only if satisfies the three above conditions (i.e., 1, 2 and 3).
Given the above formalization, our approach to replicate the real-life dataset using our simulator can be summarized as follows: In the first step, we randomly generate a set of many scenarios based on the reduced domain model. In the second step, for every scenario , we identify a subsequence such that and are comparable.
If is too large, we may find whose steering angles are too different from those in . On the other hand, if is too small, we may not able to find that is comparable to for many randomly generated scenarios in the first step. In our experiments, we select () since, based on our preliminary evaluations, we can achieve an optimal balance with this threshold.
For each comparable pair and , we measure and compare the prediction errors, i.e., and of a DNN . Recall that offline testing results for a given DNN are measured based on prediction errors in terms of MAE. If (meaning of average prediction error across all images), we say that and yields consistent offline testing results for .
Among the 100 randomly generated scenarios (i.e., ), we identified 92 scenarios that could match subsequences of the Udacity real-life test dataset. Figure 5(a) shows the steering angles for an example comparable pair and in our experiment, and Figure 5(b) and 5(c) show two example matching frames from (i.e., real dataset) and (i.e., simulator-generated dataset), respectively. As shown in the steering angle graph in Figure 5(a), the simulator-generated dataset and its comparable real dataset subsequence have very similar steering angles. Note that the actual steering angles issued by a human driver have natural fluctuations while the steering angles generated by the simulator are very smooth. Also, the example matching images in Figure 5(b) and 5(c) look quite similar.
Figure 7 shows the distributions of the differences between the prediction errors obtained for the real datasets (subsequences) and the simulator-generated datasets for each of our DNNs, Autumn and Chauffeur. For Autumn, the average prediction error difference for the real datasets and the simulator-generated datasets is . Further, 96.7% of the comparable pairs show a prediction error difference below 0.1 (). This means that the (offline) testing results obtained for the simulator-generated datasets are consistent with those obtained using the real-world datasets for almost all comparable dataset pairs. On the other hand, for Chauffeur, 68.5% of the comparable pairs show a prediction error difference below . This means that the testing results between the real datasets and the simulator-generated datasets are inconsistent in 31.5% of the 92 comparable pairs. Specifically, for all of the inconsistent case, we observed that the MAE value for the simulator-generated dataset is greater than the MAE value for the real-world dataset. It is therefore clear that the prediction error of Chauffeur tends to be larger for the simulator-generated dataset than the real-world dataset. In other words, the simulator-generated datasets tend to be conservative for Chauffeur and report more false positives than for Autumn in terms of prediction errors. We also found that, in several cases, Chauffeur’s prediction errors are greater than while Autumn’s prediction errors are less than for the same simulator-generated dataset. One possible explanation is that Chauffeur is over-fitted to the texture of real images, while Autumn is not thanks to the image preprocessing module. Nevertheless, the average prediction error differences between the real datasets and the simulator-generated datasets is for Chauffeur, which is still less than . This implies that, although Chauffeur will lead to more false positives (incorrect safety violations) than Autumn, the number of false positives is still unlikely to be overwhelming.
We remark that the choice of simulator as well as the way we generate data using our selected simulator, based on carefully designed experiments such as the ones presented here, are of great importance. Selecting a suboptimal simulator may lead to many false positives (i.e., incorrectly identified prediction errors) rendering simulator-generated datasets ineffective.
Iii-C RQ1: Comparison between Offline and Online Testing Results
We aim to compare offline and online testing results in this research question. We randomly generate 50 scenarios and compare the offline and online testing results for each of the simulator-generated datasets.
For the scenario generation, we use the extended domain model (see Figure 4) to take advantage of all the feasible features provided by the simulator. Specifically, in Figure 4, the gray-colored entities and attributes in bold are additionally included in the extended domain model compared to the restricted domain model used for RQ0. For example, the (full) domain model contains various weather conditions, such as rain, snow, and fog, in addition to sunny.
Let be the set of randomly generated scenarios based on the (full) domain model. For each scenario , we prepare the simulator-generated dataset for offline testing and measure . For online testing, we measure .
Since MAE and MDCL are different metrics, we cannot directly compare MAE and MDCL values. To determine whether the offline and online testing results are consistent or not, we set threshold values for MAE and MDCL. If (meaning the average prediction error is less than ) then we interpret the offline testing result of for as acceptable. On the other hand, if (meaning that the departure from the centre of the lane observed during the simulation of is less than around one meter), then we interpret the online testing result of for as acceptable. If both offline and online testing results of are consistently (un)acceptable, we say that offline and online testing are in agreement regarding testing for .
Figure 8 shows the comparison between offline and online testing results in terms of MAE and MDCL values for all the randomly generated scenarios in where . The x-axis is MAE (offline testing) and the y-axis is MDCL (online testing). The dashed lines represent the thresholds, i.e., for MAE and for MDCL. Table III provides the number of scenarios classified by the offline and online testing results based on the thresholds. The results show that offline testing and online testing are not in agreement for 44% and 34% of the 50 randomly generated scenarios for Autumn and Chauffeur, respectively. Surprisingly, offline testing is always more optimistic than online testing for the disagreement scenarios. In other words, there is no case where the online testing result is acceptable while the offline testing result is not.
offline and online testing results
Figure 9 shows one of the scenarios on which offline and online testing disagreed. As shown in Figure 8(a), the prediction error of the DNN for each image is always less than . This means that the DNN appears to be accurate enough according to offline testing. However, based on the online testing result in Figure 8(b), the ego vehicle departs from the center of the lane in a critical way (i.e., more that ). This is because, over time, small prediction errors accumulate, eventually causing a critical lane departure. Such accumulation of errors over time is only observable in online testing, and this also explains why there is no case where the online testing result is acceptable while the offline testing result is not.
The experimental results imply that offline testing cannot properly reveal safety violations in ADS-DNNs, because it does not consider their closed-loop behavior. Having very acceptably small errors on single images does not guarantee that there will be no safety violations in the driving environment. Considering the fact that detecting safety violations in ADS is the ultimate goal of ADS-DNN testing, we conclude that online testing is preferable to offline testing for ADS-DNNs.
Iii-D Threats to Validity
We propose a two-step approach that builds simulator-generated datasets comparable to a given real-life dataset. While it achieves its objective, as shown in Section III-B2, the simulated images are still different from the real images. However, we confirmed that the prediction errors obtained by applying our subject DNNs to the simulator-generated datasets are comparable with those obtained for their corresponding real-life datasets. Thus, the experimental results that offline and online testing results often disagree with each other are valid.
We used a few thresholds that may change the experimental results quantitatively. To reduce the chances of misinterpreting the results, we selected intuitive and physically interpretable metrics directly to evaluate both offline and online test results (i.e, prediction errors and safety violations), and defined threshold values based on common sense and experience. Further, adopting different threshold values, as long as they are within a reasonable range, does not change our findings. For example, if we use as a threshold in offline testing results instead of , the numbers of scenarios in Table III change. However, it does not change the fact that we have many scenarios for which offline and online testing results disagree, nor does it change the conclusion that offline testing is more optimistic than online testing.
Though we focused, in our case study, on lane-keeping DNNs (steering prediction)—which have rather simple structures and do not support braking or acceleration, our findings are applicable to all DNNs in the context of ADS as long as the closed-loop behavior of ADS matters.
Iv Related Work
summarizes DNN testing approaches specifically proposed in the context of autonomous driving systems. Approaches to the general problem of testing machine learning systems are discussed in the recent survey by Zhang et al..
|Author(s)||Year||Testing mode||DNN’s role||Summary|
|Dreossi et al. ||2017||Offline||Object detection||Test image generation by arranging basic objects using greedy search|
|Pei et al. ||2017||Offline||Lane keeping||Coverage-based label-preserving test image generation using joint optimization with gradient ascent|
|Tian et al. ||2018||Offline||Lane keeping||Coverage-based label-preserving test image generation using greedy search with simple image transformations|
|Tuncali et al. ||2018||Online||Object detection||Test scenario generation using the combination of covering arrays and simulated annealing|
|Wicker et al. ||2018||Offline||Traffic sign recognition||
Adversarial image generation using feature extraction
|Zhang et al. ||2018||Offline||Lane keeping||Label-preserving test image generation using Generative Adversarial Networks (GANs)|
|Zhou et al. ||2018||Offline||Lane keeping||Adversarial billboard-image generation for digital and physical adversarial perturbation|
|Gambi et al. ||2019||Online||Lane keeping||Automatic virtual road network generation using search-based Procedural Content Generation (PCG)|
|Kim et al. ||2019||Offline||Lane keeping||Improving the accuracy of DNNs against adversarial examples using surprise adequacy|
|Majumdar et al. ||2019||Online||Object detection,
|Test scenario description language and simulation-based test scenario generation to cover parameterized environments|
|Zhou et al. ||2019||Offline||Object detection||Combination of Metamorphic Testing (MT) and fuzzing for 3-dimensional point cloud data|
|This paper||2019||Offline and online||Lane keeping||Comparison between offline and online testing results|
In Table IV, online testing approaches are highlighted in grey. As indicated in Table I, offline testing approaches focus on DNNs as individual units without accounting for the closed-loop behavior of a DNN-based ADS. Most of them aim to generate test data (either images or 3-dimensional point clouds) that lead to DNN prediction errors. Dreossi et al.  synthesized images for driving scenes by arranging basic objects (e.g., road backgrounds and vehicles) and tuning image parameters (e.g., brightness, contrast, and saturation). Pei et al.  proposed DeepXplore
, an approach that synthesizes images by solving a joint optimization problem that maximizes both neuron coverage (i.e., the rate of activated neurons) and differential behaviors of multiple DNNs for the synthesized images. Tian et al. presented DeepTest, an approach that generates label-preserving images from training data using greedy search for combining simple image transformations (e.g., rotate, scale, and for and rain effects) to increase neuron coverage. Wicker et al.  generated adversarial examples, i.e., small perturbations that are almost imperceptible by humans but causing DNN misclassifications, using feature extraction from images. Zhang et al.  presented DeepRoad, an approach that produces various driving scenes and weather conditions by applying Generative Adversarial Networks (GANs) along with corresponding real-world weather scenes. Zhou et al.  combined Metamorphic Testing (MT) and Fuzzing for 3-dimensional point cloud data generated by a LiDAR sensor to reveal erroneous behaviors of an object detection DNN. Zhou et al.  proposed DeepBillboard, an approach that produces both digital and physical adversarial billboard images to continuously mislead the DNN across dashboard camera frames. While this work is different from the other offline testing studies as it introduces adversarial attacks through sequences of frames, its goal was still the generation of test images to reveal DNN prediction errors. In contrast, Kim et al.  defined a coverage criterion, called surprise adequacy, based on the behavior of DNN-based systems with respect to their training data. Images generated by DeepTest were sampled to improve such coverage and used to increase the accuracy of the DNN against adversarial examples.
Online testing studies exercise the ADS closed-loop behavior and generate test driving scenarios that cause safety violations, such as unintended lane departure or collision with pedestrians. Tuncali et al.  were the first to raise the problem that previous works mostly focused on the DNNs, without accounting for the closed-loop behavior of the system. Gambi et al.  also pointed out that testing DNNs for ADS using only single frames cannot be used to evaluate closed-loop properties of ADS. They presented AsFault, a tool that generates virtual roads which cause self-driving cars to depart from their lane. Majumdar et al.  presented a language for describing test driving scenarios in a parametric way and provided Paracosm, a simulation-based testing tool that generates a set of test parameters in such a way as to achieve diversity. We should note that all the online testing studies rely on virtual (simulated) environments, since, as mentioned before, testing DNNs for ADS in real traffic is very dangerous and expensive. Further, there is a growing body of evidence demonstrating that simulation-based testing is effective at finding violations. For example, recent studies for robotic applications show that simulation-based testing of robot function models not only reveals most bugs identified during outdoor robot testing, but that it can additionally reveal several bugs that could not have been detected by outdoor testing .
In summary, even though online testing has received more attention recently, most existing approaches to testing DNN in the context of ADS focus on offline testing. We note that none of the existing techniques compare offline and online testing results, and neither do they demonstrate relative effectiveness of test datasets obtained from simulators compared to those captured from real-life.
In this paper, we distinguish two general modes of testing, namely offline testing and online testing, for DNNs developed in the context of Advanced Driving Systems (ADS). Offline testing search for DNN prediction errors based on test datasets obtained independently from the DNNs under test, while online testing focuses on detecting safety violations of a DNN-based ADS in a closed-loop mode by testing it in interaction with its real or simulated application environment. Offline testing is less expensive and faster than online testing but may not be effective at finding significant errors in DNNs. Online testing is more easily performed and safer with a simulator but we have no guarantees that the results are representative of real driving environments.
To address the above concerns, we conducted a case study to compare the offline and online testing of DNNs for the end-to-end control of a vehicle. We also investigated if we can use simulator-generated datasets as a reliable substitute to real-world datasets for DNN testing. The experimental results show that simulator-generated datasets yield DNN prediction errors that are similar to those obtained by testing DNNs with real-world datasets. Also, offline testing appears to be more optimistic than online testing as many safety violations identified by online testing were not suggested by offline testing prediction errors. Furthermore, large prediction errors generated by offline testing always led to severe safety violations detectable by online testing. Such results have important practical implications for DNN testing in the context of ADS.
As part of future work, we plan to develop an approach that effectively combines both offline and online testing to automatically identify critical safety violations. We also plan to investigate how to improve the performance of DNN-based ADS using the identified prediction errors and safety violations for further learning.
We gratefully acknowledge funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694277) and from IEE S.A. Luxembourg.
-  (2016) Autumn model. Note: https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models/autumnAccessed: 2019-10-11 Cited by: §III-A.
-  Apollo open platform. Note: https://www.esi-group.com/software-solutions/virtual-environment/virtual-systems-controls/esi-pro-sivictm-3d-simulations-environments-and-sensorsAccessed: 2019-10-11 Cited by: §II-B.
-  (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §I.
-  (2016) Chauffeur model. Note: https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models/chauffeurAccessed: 2019-10-11 Cited by: §III-A.
Deepdriving: learning affordance for direct perception in autonomous driving.
Proceedings of the IEEE International Conference on Computer Vision, pp. 2722–2730. Cited by: §I.
-  (2017) Deep steering: learning end-to-end driving model from spatial and temporal visual cues. arXiv preprint arXiv:1708.03798. Cited by: §I.
-  (2012) Multi-column deep neural networks for image classification. CoRR abs/1202.2745. External Links: Cited by: §I.
-  (2013-05) New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 8599–8603. External Links: Cited by: §I.
-  CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: §II-B.
-  (2017) Systematic testing of convolutional neural networks for autonomous driving. External Links: Cited by: TABLE IV, §IV.
-  ESI pro-sivic - 3d simulations of environments and sensors. Note: https://github.com/ApolloAuto/apolloAccessed: 2019-10-11 Cited by: §II-B.
-  (2019) Automatically testing self-driving cars with search-based procedural content generation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, New York, NY, USA, pp. 318–328. External Links: Cited by: TABLE IV, §IV.
Are we ready for autonomous driving? the kitti vision benchmark suite.
2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 3354–3361. External Links: Cited by: §II-A.
-  (2019) Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, Piscataway, NJ, USA, pp. 1039–1049. External Links: Cited by: §II-A, §III-A, TABLE IV, §IV.
-  (2018) DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 120–131. External Links: Cited by: §I, §II-A.
-  (2019) Paracosm: a language and tool for testing autonomous driving systems. External Links: Cited by: TABLE IV, §IV.
-  Object constraint language specification. Note: https://www.omg.org/spec/OCL/Accessed: 2019-10-11 Cited by: §II-D.
-  (2017) DeepXplore: automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, New York, NY, USA, pp. 1–18. External Links: Cited by: §I, TABLE IV, §IV.
-  (2019-05) ICSE 2019 keynote: building reproducible, reusable, and robust machine learning software. Note: https://2019.icse-conferences.org/details/icse-2019-Plenary-Sessions/20/Building-Reproducible-Reusable-and-Robust-Machine-Learning-SoftwareAccessed: 2019-10-11 Cited by: §III-A.
-  (1989) Alvinn: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §II-A.
-  (2017-07) Can robot navigation bugs be found in simulation? an exploratory study. In 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), Vol. , pp. 150–159. External Links: Cited by: §IV.
-  Supporting materials (temporal link for the double-blind review). Note: https://tinyurl.com/ICST-2020Accessed: 2019-10-14 Cited by: §I, §II-D, §III-A.
-  (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. Cited by: §I.
-  PreScan: simulation of adas and active safety. Note: https://tass.plm.automation.siemens.comAccessed: 2019-10-11 Cited by: §I, §II-B.
-  (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 303–314. External Links: Cited by: §I, §II-A, §III-A, TABLE IV, §IV, footnote 1.
-  (2018-06) Simulation-based adversarial test generation for autonomous vehicles with machine learning components. In 2018 IEEE Intelligent Vehicles Symposium (IV), Vol. , pp. 1555–1562. External Links: Cited by: TABLE IV, §IV.
-  (2016) Udacity self-driving challenge 2, ch2-001 (testing) and ch2-002 (training). Note: https://github.com/udacity/self-driving-car/tree/master/datasets/CH2Accessed: 2019-10-11 Cited by: §II-D, §III-A.
-  (2016) Udacity self-driving challenge 2. Note: https://github.com/udacity/self-driving-car/tree/master/challenges/challenge-2Accessed: 2019-10-11 Cited by: §I, §II-A, §III-A.
-  (2018) Feature-guided black-box safety testing of deep neural networks. In Tools and Algorithms for the Construction and Analysis of Systems, D. Beyer and M. Huisman (Eds.), Cham, pp. 408–426. External Links: Cited by: TABLE IV, §IV.
-  (2019) Machine learning testing: survey, landscapes and horizons. External Links: Cited by: §I, §IV.
-  (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 132–142. External Links: Cited by: §I, §II-A, §III-A, TABLE IV, §IV.
-  (2018) DeepBillboard: systematic physical-world testing of autonomous driving systems. External Links: Cited by: §I, TABLE IV, §IV.
-  (2019-02) Metamorphic testing of driverless cars. Commun. ACM 62 (3), pp. 61–67. External Links: Cited by: TABLE IV, §IV.