Recently, a pedestrian was killed by a self-driving car in a crash in Arizona . This accident has brought up a hot debate on whether it is safe to test autonomous vehicles (AVs) on public roads. People argue that it is not responsible to test self-driving cars in public areas, because of the safety concern. On the other hand, AVs are designed to operate on public roads. To claim an AV has a high safety level without on-road tests is not fully convincing.
Besides the safety issue, on-road tests are also time-consuming and expensive to implement. People have already considered alternatives to on-road tests. For instance, Waymo developed computer simulation platform for self-driving cars training and testing . Studies also consider how to smartly utilize test tracks, e.g. combining with Augmented Reality (AR) technique, which generates virtual cars and pedestrians on a test track.
These test approaches provides plenty of choices, however, it is hard to claim either one is the “best” option. The pros and cons of these approaches make it unclear that what test should be implemented if only one of them can be chosen. The on-road tests are expensive and risky, but it is the most credible. The AR on-track test is less risky, but it might overlook factors in naturalistic driving environment and is still relatively time-consuming and expensive. Computer simulation is cheap and quick, however, it is less credible than physical tests. On the other hand, test results from different resources cannot be naively combined because of the different credibility.
The nature of autonomous and intelligent systems also brings difficulties in evaluating AVs from historical performance. AVs are based on algorithms that might be updated after a while, which makes it illegitimate to directly utilize historical test results and data that is collected before the update. Consider that these data contains some information about the current version of algorithms and is relatively plenty in general, the evaluation procedure will be more efficient if we have a way to link them with the current algorithm.
In this paper, we are targeting this problem in AV testing: how to synthesize “independent” test results from different test approaches and historical data from outdated models into the safety evaluation of an AV. We discuss an AV evaluation approach that is capable of synthesizing and integrating test results from different resources and historical data. This approach is based on co-Kriging models, in which we consider each type of tests as models with an assigned fidelity level (we take historical data as a type of test). The model provides a response surface for the test performance function of interest. The model allows us to analyze the performance of a test AV using a combination of different types of test results (and historical test data). Furthermore, the proposed model can potentially be used to design new experiments, i.e, determine what test to implement based on current information for improving the model with regards to evaluation accuracy.
To link the proposed model with AV testing, we follow the test framework in the accelerated evaluation method, which is first proposed in 
. We decompose naturalistic driving environment into different test scenarios and use statistical models to represent the environment in the scenarios. For each test scenario, we use the probability of safety-critical events to evaluate the safety level of an AV. discussed an approach that uses response surface model in the accelerated evaluation context, and used a single type test results. This paper extends the use of response surface model in . Besides results of on-road tests, the proposed approach also obtains information from other test resources, which improves the accuracy of the response surface.
The framework we proposed is related to multi-fidelity models (for review, see ). The co-Kriging model we adopt is originally considered in , which was proposed in an optimization context. This model utilizes Kriging model or Gaussian process in constructing the response surface. We also discuss the extension of the design of experiments scheme in  to multi-source tests. The scheme can help us smartly select design points and therefore avoid unnecessary experiments in the model constructing procedure. We will further illustrate the difference between these two approaches using several numerical examples. We use the proposed methods to study the lane change scenario, which has been studied in [3, 7, 8].
This paper is structured as follows: Section II introduce the basics of Kriging, co-Kriging and the multi-fidelity models. Section III discusses the properties of the proposed multi-fidelity model, the design of experiment scheme and the application in AV testing. Section IV shows numerical experiments using the proposed method.
Ii Kriging-based Surrogate Models
In this section, we introduce the Kriging-based surrogate models that we propose to use for AV testing. We first review the basics of Kriging model. Then we introduce the idea of co-Kriging and show how to turn this idea into a multi-fidelity model. The model is extended from .
Ii-a Basics of Kriging Model
Kriging, a model named after the developer Krige, was originally used in geostatistics . The model has been extensively used in engineering fields since it is studied and introduced under the design of experiments context in . For a wider scope of Kriging application, one can refer to .
Suppose we want to study a performance function on the design space . The performance function is only available through experiments (or observations). The Kriging model allows us to construct a response surface based on experiment results, for , where is the number of experiments we have collected.
Here we consider a Bayesian view of Kriging model. The key idea of Kriging is to consider the response surface of as a posterior of a Gaussian random field (or Gaussian process) [12, 13]. A Gaussian random field for is specified by a mean function, , and a covariance function, . We denote such a Gaussian random field as
For any ,
is Gaussian random variable with mean
and variance. For , the covariance between and is . We assume the following structure for the mean and covariance function and , where , and are tunable parameters. Note that the covariance function indicates that the variance is stationary over .
We consider the above functions as the prior mean and covariance of the Gaussian random field . Let denotes the experiments at and denotes the corresponding observations. We use to construct a matrix , where . And let . Note that . Given observations , for any we have the posterior mean and covariance function as
returns a vector withas the th element.
This posterior Gaussian random field is the Kriging model for . In this paper, we use , . We denote and for simplification.
Ii-B Co-Kriging and Multi-fidelity Model
Here we discuss the co-Kriging model that is studied in  and show how to extend it to fit for AV testing. The idea of co-Kriging is to use the summation of two Kriging models as the response surface. Because of a nice property for Gaussian random variables (the summation of two Gaussian random variables is still Gaussian), the co-Kriging model is still a Gaussian random field.
Now we consider that the performance function is a summation of two factor functions and , i.e. . And the factor functions are only available through experiments. We use to denote the data set for the factor function , where contains experiments and contains the corresponding observations . Similarly, denotes the data set for factor function with observations. We use to denote the whole data set (including and ).
We construct Kriging models that is described in Section II-A for both factor functions and independently. (Here we assume the two factors are independent, which means that the value of does not contain any information for .) We denote the Kriging model for as and the Kriging model for as . The co-Kriging model for the performance function is given by .
As we mentioned, the co-Kriging model is still a Gaussian random field. For any , is the summation of two Gaussian random variable and . Therefore, has mean and variance (because and are independent), where denotes the mean function for and respectively , and denotes the covariance function for and respectively.
Ii-B2 Multi-fidelity Model
To extend the general co-Kriging model shown above as a multi-fidelity model, we consider the following settings. We consider that we have models with different fidelity for the performance function , which are denoted as , where is the fidelity level. We assume that a larger indicates a better fidelity. In this case, is the model with lowest fidelity and is the model with the highest fidelity. Usually, we consider , which means that the original performance function is the highest fidelity model.
For each fidelity , we have a data set , where contains experiments on and contains the corresponding observations . Here we adopt the assumption in , that is . This assumption is not necessary for the construction of the multi-fidelity model, however, it allows the multi-fidelity model to maintain a nice property of single Kriging model (further discussed in Section II-C). We use to denote all data sets for simplicity.
The procedure of constructing the multi-fidelity model is as follows. We first construct a Kriging model for the lowest fidelity model using data set and we denote the model as . We have as a Gaussian random field with mean
Then we construct response surface for other fidelities layer by layer. Starting from the fidelity , we consider to build a Kriging model for the difference between two adjacent fidelities, i.e. . We create a data set using and , such that . (For any , we have because .) Now we use the created data set to construct a Kriging model, that we denote as . Note that is a Gaussian random field with mean (for simplicity we still use to denote the data set we use)
Then we have the response surface model for , which is given by . For convenience, we have define . Now for the model with fidelity , we have
Note that each is a Gaussian random field, and therefore is still a Gaussian random field.
is our multi-fidelity model for the performance function , which is a Gaussian random field with mean function
and variance function
Compared to Kriging model that only uses observations of the performance function (or the highest fidelity model), the proposed multi-fidelity model integrates information from models with lower fidelities, while it maintains an important property of the Kriging model. In Kriging, the prediction on is exact if you have already observed . With the assumption on data set structure, the proposed multi-fidelity model maintains this good property. This means that the prediction accuracy of the proposed multi-fidelity model will increase in a similar way as the Kriging model and observations of increases.
Since the proposed multi-fidelity model provides a linkage between the performance function and the lower fidelity models, we are able to study how much information an experiment in lower fidelity can bring. This linkage allows us to design experiments and choose the fidelity level that is economic with regard to the information it brings in.
As a side product, the proposed multi-fidelity model provides a response surface model for each fidelity model. This side product allows us to develop experiment design scheme that uses lower fidelity information. We will further discuss this in Section III.
The response surface models for different fidelity levels have a property as follows. For any fidelity parameters and at , we always have because and . This property is intuitively reasonable, since we have less information about a higher fidelity model.
Iii Synthesizing Tests in Accelerated Evaluation
In this section, we discuss applying the proposed model to AV testing. More specifically, we consider applying the proposed model in the context of test scenario based AV evaluation that has been studied by [3, 4]. We first review the problem setting in test scenario based AV evaluation. We then show how this model is applied to synthesize data from different test sources.
Iii-a Problem Setting in AV Evaluation
Accelerated evaluation  is an approach to efficiently evaluate the safety level of an AV. This approach evaluates AV based on the test AV’s performance in different traffic scenarios. The traffic scenarios are decomposed from naturalistic driving and are considered to be safety-critical since a very high percentage of crashes occurred in these scenarios . Examples of there scenarios are discussed in [7, 16, 17].
For each of these traffic scenarios, the uncertainty in the driving environment is modeled as stochastic (follows some statistical model). The probability of safety-critical events (e.g. crash) is used as the criterion for determining the safety level. Therefore, the task of this approach is to estimate the probability of safety critical events in each test scenarios.
The problem is mathematically defined as follows. Let us use to denote the variable that represents the uncertainty in the driving environment and is modeled as a distribution . We use to represent a performance function that the safety-critical events depend on and use to denote the threshold for to trigger the safety-critical events (this means that indicates that a safety-critical event occurs at ).
For instance, if we want to estimate the probability of crash in the lane change scenario (refer to Fig. 1). In this test scenario, a frontal human-driving vehicle cuts into the lane of a test AV. The environment uncertainty, , consists of the following variables: the velocity of the frontal car, , the relative speed between the frontal car and the test AV, , and the range between these two cars,
. The uncertainty of these variables is modeled as probability distribution. We define the performance function as the minimum range between the test AV and the frontal vehicle. We estimate , which represents the probability of crash in this scenario, to evaluate the safety level of an AV under this test scenario.
Iii-B Synthesizing Tests
When we want to evaluate an AV in a certain test scenario, we have several resources to select. Here we consider these different test resources as models with different fidelities. We rank the fidelity level of the test resources in an arbitrary way (the rank is unnecessary to be “correct” for the model to work, but will affect the accuracy). For example, we consider the on-road test as the highest fidelity model, since this is the “true” test in the evaluation. Then we consider an AR test has lower fidelity, because the AR test maintains the check on the physical part of the test AV. A pure computer simulation of the vehicle algorithm is considered to have lower fidelity than AR test, since physical parts are not considered in this case. Lastly, we consider the historical data or test results for similar designed AVs as the lowest fidelity model.
After we rank different test resource with fidelity levels, given data set collected from these tests for , we are able to use the proposed multi-fidelity model to construct a response surface for . The procedure follows the multi-fidelity model construction in Section II-B and is illustrated in Fig. 2. Following , we use the response surface to estimate the probability of safety critical events. The estimation is given by
where the inner part denotes the probability of given the value of and the outer expectation is over the distribution .
As we pointed out in Section II-C, besides the probability estimation, the multi-fidelity model can be used to provide a guideline for designing experiments. More specifically, here we want to collect new data and use to construct a better model. We need to decide the fidelity levels ’s and the value of in , so that we can do experiments on those ’s and ’s to collect the response . In [4, 18], an experiment design scheme for Kriging based on the information gain (IG) is discussed. Here, we define IG at point for model with fidelity as
where denotes the probability estimation with samples in the samples set and denotes the probability estimation with sample set and an additional sample . Here we use the response surface for the model with fidelity to compute the IG, where we take advantage of the “side product” of the multi-fidelity model. Consider that the cost for implement an experiment at for model with fidelity is , similar to , we set our design selection criterion to be
Iv Numerical Experiments
In this section, we consider two numerical experiments to show the advantage of the proposed model. We first consider a one-dimension problem to illustrate the proposed method. Then we apply the proposed model to an AV test scenario.
Iv-a Illustration Example
To illustrate how the proposed model integrates experiment results from models with different fidelities, we set up an one-dimensional problem as follows. Suppose we are interested in the performance function and we have two models and that are approximation of . For any design variable , the response of are unknown and need to be observed from an experiment at . An experiment on the models has a lower cost than an experiment on , while has a lower cost but less accuracy (lower fidelity).
In this example we assume that the real performance function of interest is
The model with higher fidelity is represented by
and the model with lower fidelity is given by
We consider the design space as . Fig. 3 shows the response of these three functions at different . We observe that the two models roughly capture the shape of the performance function, and the high fidelity is a better approximation to the performance function. Note that according to the notations we used in this paper, we have .
With the above setting, let us consider that we have some experiment results from these models and we use these results to construct a response surface for the performance function. For the performance function , we have experiment results at . For the higher fidelity model , we have experiment results at . For the lower fidelity model , we have experiment results at . (We have more experiment results for lower fidelity models.)
We first construct a Kriging model with the experiment results from the performance function. We use the Kriging model as the baseline of response surface models. Fig. 4 shows the Kriging model we obtain. We observe that the mean of the response surface (blue solid line) is not close to the real function (green dash line) in most part of the region (e.g. and
) and the 95% confidence interval (red dot line) does not contain the real function in. Note that the Kriging model is built with only 4 data points (the blue circles in Fig. 4), the inaccuracy of the response surface is expected.
Now we start to consider the experiment results from the models . We first consider the results from the higher fidelity model and use the proposed model to construct a response surface as described in Section II-B. Fig. 5 shows the response surface of the multi-fidelity model. The data points from are represented as orange squares. The resulting response surface is closer to the real function than the Kriging model (in Fig. 4), because the mean (blue solid line) has a similar shape and the confidence interval (red dot line) almost contains the real function everywhere in the region. This improvement is also confirmed by the mean squared error (MSE) of the mean of the response surfaces. The MSE decreased from 0.0572 (the Kriging model in Fig 4) to 0.0093 (the multi-fidelity model in Fig 5). Because the tail region does not have any experiments, the shape of the true function is still not well captured (the confidence interval still contains the real function in most part).
We then further take the experiments from the lower fidelity model into consideration. We construct a multi-fidelity model with experiments from all models (from ). The response surface is shown in Fig. 6, where we use yellow asterisk to represent the experiments from . Compared to the response surface in Fig 5, this response surface has better prediction in the region , while it has a similar response in rest of the design space. The improvement of the tail region further decreased the MSE to 0.0087 (compare to 0.0093 in Fig 5).
The comparison of the three response surfaces shows us how the proposed model improves our prediction with lower fidelity models. In the region that experiments of higher fidelity models is not available, the model utilizes the information provided by the lower fidelity models. On the other hand, involving lower fidelity model experiments does not change the region with sufficient higher fidelity models.
Iv-B Implementation to AV Test Scenario
Here we study the test scenario example we described in Section III-A. In the test scenario, we want to study the performance of a test AV in the driving environment. The performance function of interest is the minimum range of the test vehicle and the cut-in vehicle. Note that the input of the function is consisted of three variables .
To implement the proposed model, we use experiment results from two models. We first select design points for input using a mesh grid design that has from 5 m/s to 35 m/s with a 2 m/s interval, from 0 m/s to 30 m/s with a 2 m/s interval, from 0.1 to 1 with a 0.1 interval. Let us denote the design point set as and contains 2,560 design points. We collect experiment results on these design points from the real performance function and denote the experiment set as . We then randomly split the set into two, and , where contains 1000 samples. We further extract 500 samples from , and denote the obtained set as . We then perturb the experiment results in by a uniform noise from . Now we consider as the data set from the lower fidelity model and as the data set from the higher fidelity model (or real function).
Similar to Section IV-A, we first use the higher fidelity data set to construct a Kriging model and then use and to construct a multi-fidelity model. We compare the two models to show the advantage of taking lower fidelity experiments into consideration. In this case, we use as the test data set to compute the MSE of the response surface mean of the two models. The result shows that the proposed model reduces the MSE to 2.3836 from 3.3948 of the Kriging model.
-  D. Wakabayashi, “Self-driving uber car kills pedestrian in arizona, where robots roam.” [Online]. Available: https://www.nytimes.com/2018/03/19/technology/uber-driverless-fatality.html
-  A. C. Madrigal, “Inside waymo’s secret world for training self-driving cars,” Aug 2017. [Online]. Available: https://www.theatlantic.com/technology/archive/2017/08/inside-waymos-secret-testing-and-simulation-facilities/537648/
-  D. Zhao, H. Lam, H. Peng, S. Bao, D. J. LeBlanc, K. Nobukawa, and C. S. Pan, “Accelerated Evaluation of Automated Vehicles Safety in Lane-Change Scenarios Based on Importance Sampling Techniques,” IEEE Transactions on Intelligent Transportation Systems, 2016.
-  Z. Huang, H. Lam, and D. Zhao, “Towards affordable on-track testing for autonomous vehicle-a kriging-based statistical approach,” Proceedings of the IEEE 20th International Intelligent Transportation Systems Conference, 2017.
-  M. G. Fernández-Godino, C. Park, N.-H. Kim, and R. T. Haftka, “Review of multi-fidelity models,” arXiv preprint arXiv:1609.07196, 2016.
-  A. I. Forrester, A. Sóbester, and A. J. Keane, “Multi-fidelity optimization via surrogate modelling,” in Proceedings of the royal society of london a: mathematical, physical and engineering sciences, vol. 463, no. 2088. The Royal Society, 2007, pp. 3251–3269.
-  Z. Huang, D. Zhao, H. Lam, and D. J. Leblanc, “Accelerated Evaluation of Automated Vehicles Using Piecewise Mixture Models,” IEEE Transactions on Intelligent Transportation Systems, 2016.
-  Z. Huang, Y. Guo, H. Lam, and D. Zhao, “A versatile approach to evaluating and testing automated vehicles based on kernel methods,” American Control Conference, 2018.
-  D. Krige, Two-dimensional weighted moving average trend surfaces for ore evaluation. South African Institute of Mining and Metallurgy Johannesburg, 1966.
-  J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn, “Design and analysis of computer experiments,” Statistical science, pp. 409–423, 1989.
-  J. P. Kleijnen, “Kriging metamodeling in simulation: A review,” European Journal of Operational Research, vol. 192, no. 3, pp. 707–716, 2 2009.
C. E. Rasmussen, “Gaussian Processes in Machine Learning.” Springer Berlin Heidelberg, 2004, pp. 63–71.
-  J. Staum, “Better simulation metamodeling: The why, what, and how of stochastic kriging,” in Proceedings of the 2009 Winter Simulation Conference (WSC). IEEE, 12 2009, pp. 119–133.
-  B. Ankenman, B. L. Nelson, and J. Staum, “Stochastic kriging for simulation metamodeling,” Operations research, vol. 58, no. 2, pp. 371–382, 2010.
-  W. G. Najm, S. Toma, and J. Brewer, “Depiction of priority light-vehicle pre-crash scenarios for safety applications based on vehicle-to-vehicle communications,” Tech. Rep., 2013.
-  X. Wang, D. Zhao, H. Peng, and D. J. LeBlanc, “Analysis of unprotected intersection left-turn conflicts based on naturalistic driving data,” in 2017 IEEE Intelligent Vehicles Symposium (IV), June 2017, pp. 218–223.
-  B. Chen, D. Zhao, and H. Peng, “Evaluation of automated vehicles encountering pedestrians at unsignalized crossings,” in 2017 IEEE Intelligent Vehicles Symposium (IV), June 2017, pp. 1679–1685.
-  Z. Huang, H. Lam, and D. Zhao, “Sequential experimentation to efficiently test automated vehicles,” Proceedings of the Winter Simulation Conference 2017, 2017.
-  R. Stroh, J. Bect, E. Vazquez, S. Demeyer, and N. Fischer, “Sequential design of experiment on a stochastic multi-fidelity simulator to estimate a probability of exceeding a threshold,” in Journées annuelles du GdR MASCOT NUM (MASCOT NUM 2017), 2017.