I Introduction
Advanced Driver Assistance System (ADAS) or Automated Driving System (ADS) equipped Connected and Automated Vehicles (CAVs) operate in a mixed traffic environment with various traffic participants (e.g., pedestrians, cyclists, and different types of vehicles) and environmental disturbances (e.g., road gradients, surface friction, and weather conditions). In general, to ensure the safe performance of a Subject Vehicle (SV) or a fleet of SVs (e.g., a group of CAVs) in the realworld mixed traffic driving environment (also referred to as the naturalistic driving environment in the literature [feng2020testing] and the “nominal driving environment” in the remainder of this paper), one typically follows a twostep procedure with testing and analysis. First, the testing procedure deploys the SV (or a fleet of SVs) in the environment and acquires the traffic interactions and other observable infrastructure information [altekar2021infrastructure] near each SV with a certain data acquisition system. This creates a set of finite observations sampled from the nominal driving environment. Note that the environment can consist of simulated scenes, realworld onroad operation, or controlled testing and proving grounds. Second, the analysis procedure summarizes the safety performance from the finite sampled observations and seeks to generalize the understanding, intuitively or provably, to the nonsampled unobserved cases. In this paper, we only focus on the analysis step, which involves one or more specific safety metrics to which the previously collected data is presented as it stands.
Let’s start from a toy example of one observing a certain SV operating safely (without collisions, human driver engagements, or breaking traffic rules) navigating from the Empire State Building to the Times Square (both are attractions in New York City, United States) for one mile at 6 P.M. on a weekday through a crowd of vehicles, pedestrians, cyclists, and an intersection with traffic lights. A safety measure then seeks to infer the SV’s overall safety performance in the mixedtraffic driving environment from the collected onemile observation.
The first class of measures are known as leading measures as they “reflect performance, activity, and prevention” [fraade2018measuring], such as infractions (i.e., noncriminal violations of state and local traffic law) [censi2019liability] and disengagements [favaro2018autonomous]. In general, it is expected that the leading measure outcomes from the onemile trip imply a certain safety property, yet such an implication is mostly intuitive (e.g., the observed engagement rate within one mile does not necessarily hold for the rest of the trip).
On the other hand, the lagging measures are primarily interested in safety outcomes or harm [fraade2018measuring]
. They can be further classified as observed failures, predictive failures, and inferred failures. As a collision is the most welladopted failure event in the literature, it is considered interchangeable with failure for the remainder of this paper.
The observed failures share the same spirit with many aforementioned leading measures. For example, the observed collision rate in the onemile trip does not necessarily hold as the vehicle operation proceeds. It can also be expanded from the scalar value measure to a more complex group of collision ratings [schwall2020waymo], yet the above mentioned problem still remains. Second, the predictive failure is often derived by asserting surrogate models and assumptions [bowen2020presentation, weng2021model]. Hence some lagging safety measures are also referred to as surrogate safety measures in the literature [wang2021review]. One welladopted assumption and the surrogate model is the steadystate assumption (all road users maintain the current velocity and heading) and the linear double integrator dynamics, leading to a series of classic safety measures including timetocollision (TTC) [lee1976theory] and the minimum safe distance (MSD) based variants [wishart2020driving]. Some recently propose metrics, developed as more complex dynamic and behavioral models are considered, include the Responsibility Sensitive Safety (RSS) [shalev2017formal] based method, Instantaneous Safety Metric (ISM) [every2017novel], criticality metric [junietz2018criticality], and Model Predictive Instantaneous Safety Metric (MPrISM) [weng2020model], which all belong to a class of model predictive safety measures [weng2021model]. Note that many of the lagging measures generalize the finite observations to the nonsampled cases to some extent, but the generalization relies heavily on asserted surrogate models and assumptions, which are mostly invalid in the realworld mixed traffic driving environment [bowen2020presentation, weng2021model].
Finally, in contrast to the predictive failure based lagging measures that generalize the safety assessment to the “imaginary” domain, the statistically inferred failure rate estimate is an unbiased safety assessment generalization from the observed samples to the nominal driving environment. One representative method in this category comes from Fraade et al.
[fraade2018measuring] using the MonteCarlo sampling approach, to provide the finitesampling safety assurance by inferring the SV’s fatality rate estimate from consecutively operating for a certain number of miles safely. If applied to the aforementioned onemile trip example, with confidence level 90%, the SV has a fatality rate of 90 million fatalities per 100 million miles. Despite the 90% risk being rigorously provable, note that the safety measure outcome is essentially invariant from the mixed traffic environment as one still obtains the same values if the vehicle safely operates on the same route on empty streets at 3 A.M. (i.e., no other traffic objects are present). Note that the importance sampling based technique [ding2011toward] has been shown capable of improving the sampling efficiency of the MonteCarlo sampling methods. However, the accuracy of the estimated failure rate relies heavily on the accuracy of the estimated importance function, which is not a provable condition in general.Another line of research on formal safety analysis relies on a modelbased approach where one first approximates a certain probabilistic model, parametric or nonparametric, from the observed data and then derives the risk rate estimate [aasljung2019probabilistic], information gain [collin2021plane], and other safety related properties [hejase2020methodology]
using the obtained model. This shares a similar problem with the aforementioned importance sampling based methods as the safety outcome estimate is unbiased only if the approximated model is also unbiased with analytically justifiable variance, which remains as an open challenge to date.
To a certain extent, existing efforts seek to establish a CAV safety measurement that is monotonic w.r.t. the SV’s safety performance (e.g., a lower TTC value indicates a more unsafe SV behavior than a higher TTC value). This is generally true if other variables are controlled properly. One particularly important variable is the SV’s operable domain. As we have discussed before, the leading measures fail to control the domain variables since the generalization is biased. The predictive failure based lagging measures also fail, for while the generalization is provably true in a certain predictive domain, it does not necessarily align with the nominal driving environment. Finally, for the statistically inferred collision rate [fraade2018measuring], the operable domain is invariant as the required total mileage to claim a certain fatality rate with a given confidence level does not change as one moves from the leadvehicle following domain to a more complex operable domain involving mixedtraffic interactions. Moreover, the particular SV driving behavior also partially affects its operable domain construction. As a result, the notion of one vehicle being safer than the other is mostly problematic as it is essentially a multidimensional comparison. This will be demonstrated in detail through a series of examples in Section IV.
In summary, to make a competitive safety measurement for the SV that resolves the various mentioned problems of existing methods, the following two questions need to be jointly and rigorously addressed:

Q1: Where (in terms of the operable domain) will the vehicle be statistically safe within the nominal driving environment?

Q2: Supplied with a certain operable domain, how safe will the vehicle be within the given domain?
To the best of our knowledge, there does not exist a safety metric that rigorously addresses the above two questions simultaneously.
In this paper, we propose a novel safety metric using the shape [akkiraju1995alpha] and the almost set invariance property [weng2021towards, weng2021formal]. Given the driving data collected from a certain testing procedure, the proposed method first rearranges the data to formulate the Operational State Space (OSS) of a multiagent system that admits measurable states and other nonobservable uncertainties. One then characterizes an Operational Design Domain
(ODD) as a subset of the formulated OSS that is “almost” forward invariant for the multiagent dynamics. As the characterized domain does not intersect with the set of failure events, the SV is also almost safe in the given domain except for an arbitrarily small subset with a certain confidence level. The main contribution of this paper is further summarized as follows.
An operational domain specific safety indicator The proposed method characterizes an operational domain specific set using shape and other coverage properties, which formally answers question Q1. The effectiveness of the proposed methodology is empirically demonstrated through a group of challenging cases. The study not only includes the classic threedimensional leadvehicle following domain, but also considers the challenging vehicletovehicle and vehiclepedestrian interactions with up to a 17dimensional state space.
An unbiased safety indicator The almost robustly forward invariant set is a provably unbiased safety indicator that generalizes the observation from sampled driving data to the unobserved domain. In particular, given a certain confidence level , the probability coefficient answers question Q2 by provably quantifying the performance of the SV statistically within the constructed set. The process does not involve any asserted behavioral assumptions, distribution estimates, or model fitting.
Empirical evaluation The empirical performance of the proposed method is demonstrated in a series of cases covering a variety of fidelity levels (realworld and simulators), driving environments (highway, urban, and intersections), road users (car, truck, and pedestrian), and SV driving behaviors (human driver and self driving algorithms).
Ia Constructions and Notation
Notation: The set of real and positive real numbers are denoted by and , respectively. denotes the set of all positive integers and . The norm is denoted by . is the cardinality of the set , e.g., for a finite set , denotes the total number of points in . Let be the boundary of the set . Some commonly adopted acronyms are also adopted including i.i.d. (independent and identically distributed), w.r.t. (with respect to), and w.l.o.g. (without loss of generality).
In the remainder of the paper, Section II will present the preliminaries along with formulating the finitesampling operable domain quantification problem. Section III introduces details of the proposed safety metric. The empirical performance of the proposed metric is demonstrated in Section IV. Section V summarizes the paper and discusses future work of interest.
Ii Preliminaries and Problem Formulation
Iia MixedTraffic Environment Formulation
Consider the mixed traffic environment as a timevariant heterogeneous multiagent system of agents at time where the th () agent admits the motion dynamics as
(1) 
with state , disturbances and uncertainties , . Note that the agent is not limited to dynamic road users (e.g. vehicles, pedestrians, cyclists), but can also include other environmental features such as the traffic light color, stop sign position, weather condition, and road surface friction. Let the index denote the test SV. For a fleet of SVs, one can assign the index 0 to each SV iteratively for further analysis as the safety of each individual SV ensures the overall safety of the fleet. In general, the above system can be very complex as the realworld driving environment has very large and varies with respect to time.
The desired Operational Design Domain (ODD) is thus introduced to specify the set within which the SV is expected to operate safely. Formal specifications of the ODD is further derived from an OSS with explicitly defined observable states and implicitly induced disturbances and uncertainties . Some examples of OSS specifications are presented in Fig. 1. This paper is primarily focused on three OSS specifications that are explained as follows.
IiA1 The leadvehicle following domain
This OSS characterizes the leadvehicle following safety performance. It incorporates all instances from the onroad driving data with a preceding vehicle presented in the same lane with the SV. It is applicable for many ADAS features such as Automatic Emergency Braking (AEB) and TrafficJam Assist (TJA). The leadvehicle following domain is also a commonly studied instance with other domain specifications incorporating time duration [arief2021deep] and assumed hybrid control modes [fan2017d]. In this paper, we consider a more general configuration than the aforementioned references, as the state specification takes the speed of both vehicles () and the bumpertobumper distance headway (DHW) () between the two vehicles as the states of interest. The specification is applicable for both straight road segments (Fig. 1(c)) and curved roads (Fig. 1(d)), i.e., the road curvature is considered to be part of , as are other factors such as road gradients, weather condition, and road surface friction.
IiA2 The multivehicle interaction domain
This OSS defines the SV’s interaction with nearby vehicles. All the position states are represented with respect to the SV’s local coordinates. The nearSV region is divided into 6 subregions: frontleft (fl), frontcenter (fc), frontright (fr), rearleft (rl), rearcenter (rc), and rearright (rr). The left, the center, and the right regions are typically determined by the lane width. Within each region, the nearest vehicle is determined through the centertocenter norm distance against the SV. Two features of the nearest vehicles are selected by including the bumpertobumper longitudinal distance clearance against SV and the vehicle speed . When presented with an alongside vehicle (i.e., part of the vehicles are overlapping longitudinally) on either side of the SV, the bumpertobumper distances are set to zero as shown in Fig. 1(b). Combined with the SV’s speed , we have a 13dimensional state space, i.e., . If a particular subregion is empty or if any of the states falls outside the domain of interest defined by and other given bounds, a fixed lowrisk state is assigned (e.g., if the frontcenter region is empty, we assign and ). To have a valid state , at least one of the six subregions must remain nonempty with a vehicle satisfying the state bounds. The lateral distances are treated as uncertainties as each subregion is limited by the lane width, which already provides certain sideways localization information. Some other examples of disturbances and uncertainties include the presence of other dynamic road users, the road curvature, and different road infrastructures. A similar multivehicle configuration is also adopted by other studies for scenario extraction purposes [hauer2020clustering] and driver behavioral modeling [yan2021distributionally].
IiA3 The vehiclepedestrian interaction domain
This OSS is primarily concerned with the SV interacting with pedestrians. Only pedestrians in front of the SV are involved in the specification due to responsibility oriented causes [shalev2017formal]. The frontleft corner and the frontright corner of the SV are the reference points. Then, the nearest pedestrian to each reference point in terms of 2norm distance is considered as the pedestrian of interest. For each pedestrian of interest, its longitudinal offset and lateral offset from the corresponding reference point are selected as part of the states in . Combined with the SV’s velocity , we have the 5dimensional state space for the vehiclepedestrian interaction domain.
Note that the above OSSs and possibly other variants can be further combined to formulate various mixed traffic operational environments. For example, combining the multivehicle interaction domain with the vehiclepedestrian interaction domain, results in a dimensional state space.
Remark 1.
The driving data studied in this paper can be collected from both onroad tests and scenariobased tests, as long as the data collection follows the nominal distribution of the mixedtraffic driving environment within which the SV is being tested.
W.l.o.g., let there be some states from the collected driving data consistent with the given . We then have the primary driving data which comprises finite observations that allow us to implement the safety analysis of the SV’s performance in . Moreover, some of the states are consecutively collected in time w.r.t. the same SV, which is further referred to as a trajectory . In this paper, we often extract state transition pairs from all trajectories in as , where are two consecutively collected states in time w.r.t. the same SV. Note that each state also inherently admits a certain motion dynamics as
(2) 
In this paper, our focus is to present a safety performance evaluation metric that identifies the real operable domain in a datadriven manner (from
) and identifies its safetyrelated properties. This is formally presented as the finitesampling operable domain quantification problems, as we shall introduce in the following section.IiB Finitesampling Safety Assurance with Set Invariance
Given a certain set , if one continuously observes sampled state transitions staying inside , then the set is potentially forward invariant. To formally quantify such a statistical potential, we introduce the almost forward invariant set adapted from [weng2021towards] as follows.
Definition 1.
[Almost Robustly Forward Invariant Set] Let , the set is almost robustly forward invariant for (2) if
(3) 
To further relate the above definition to the purpose of safety analysis, let be the set of unsafe states such as collisions. Then we have the following definition for the almost safe set.
Definition 2.
The problem of interest for this paper is than formally presented as follows.
Problem 1.
[The FiniteSampling Operable Domain Quantification Problem] Given and a group of observed states , the finitesampling operable domain quantification problem seeks an algorithm that identifies a certain set and such that with confidence level of at least , the SV is almost safe in by Definition 2.
The above problem formulation is fundamentally different from many of the existing CAV safety metrics as mentioned in Section I. The desired set is the specific operable domain within which the SV is expected to operate safely. The probability coefficient quantifies the statistical potential of the SV’s safety performance in . The next section discusses details of the proposed algorithm that solves the aforementioned problems.
Iii FiniteSampling Operable Domain Quantification
The proposed solution to Problem 1 follows a twostep procedure in general including (i) set construction and (ii) set validation. The set construction step seeks to construct a certain set from . Second, as one replays data in , one can then validate the almost forward invariance property of the constructed set through consecutively observing transitions among states in . The derived also relies on the given confidence level defined in Problem 1. For the remainder of this section, we shall address the aforementioned two steps, respectively, in Section IIIA and Section IIIB. The complete algorithm is summarized in Section IIIC.
Iiia Set Construction with Shape and Coverage Measures
For the safety evaluation purpose, the constructed set is expected to cover all potentially safe points. This is referred to as and is obtained through Algorithm 1. A series of methods is then proposed to formally construct a set that characterizes the shape and the coverage information of .
Note that Reachable returns all vertices on the graph that connects, directly or indirectly, to the given point . In practice, this is achieved through a standard depthfirstsearch (DFS) routine. Moreover, add and remove are both notation functions where .add adds the edge to the graph , and .remove removes all vertices in from .
We are now ready to construct the potentially safe set from . In this paper, we adopt the shape [akkiraju1995alpha] to characterize the shape of the desired set. The following definition is standard [alphashape2011].
Definition 3.
Consider a finite set of points . Let an ball be an open ball with radius . Let be a simplex for some such that . A simplex is exposed if there exists an empty ball with . An shape, , of the set satisfies and
(4) 
i.e., the boundary of the shape consists of all simplices of for which are exposed.
It follows that and is the ordinary convex hull of . The shape of a finite point set is uniquely determined by and . For any given , the corresponding shape determination algorithm comes with a time complexity of [akkiraju1995alpha], where denotes the number of points in , i.e., . In practice, one may also require a certain preferred shape such as a single polytope that wraps with the smallest cardinality. This implies a certain cost function of the shape with the optimal cost determined by the preferred shape. This is typically performed through a logarithmic search of shapes by modifying the lower and upper bounds of the tested until the gap between the two bounds becomes sufficiently small [kengithub]. However, with this method the computational complexity also increases.
Remark 2.
In the previous literature of scenariosampling almost safe set validation [weng2021towards, weng2021formal], the covering set is adopted to characterize the set construction. However, as indicated by Problem 1, the data set is presented as it is in this study, and one cannot control the scenario sampling to modify the testing procedure or to add more observations for analysis. For sparse data sets, the covering set tends to have a significant overapproximation. Moreover, the covering set of the finite set is not unique for all nonzero s. As a result, the shape is a more flexible solution to handle various levels of data sparsity with a uniquely determined solution for a given the finite set.
For this paper, the finite set being considered is derived from Algorithm 1. Hence the shape takes the notation as . Finally, to characterize the coverage performance of , we also adopt the following two measures to characterize the density and occupancy as
(5) 
Note that there also exists other coverage indicators such as the index of dispersion [selby1965index] and the star discrepancy [dang2008sensitive]. However, the index of dispersion is not directly applicable in this paper as is not all positive. The star discrepancy is not selected for computational complexity concerns. More representative coverage metrics are of future interest.
IiiB FiniteSampling Almost Robustly Forward Invariant Set Validation
We are now ready to characterize how safe the SV is in . Suppose the validation is executed online with a single SV. The data acquisition of and its corresponding are thus collected following a particular time sequence w.r.t. the same SV. At a certain step, if one starts consecutively observing transitions that start and stay inside until the end of the test, one then has statistical evidence to claim the robustly forward invariance property of by Definition 1. This is presented as a validation routine in Algorithm 2.
However, the above described online procedure is no longer applicable for a fleet of SVs deployed in the nominal driving environment test simultaneously. Moreover, a safety metric is primarily used to analyze the safety performance of a system in a postprocessing manner. That is, one replays the data set following a certain order of all elements in . For statistical inference, as the set of initializations of all transition pairs are i.i.d. w.r.t. the underlying distribution on , can thus be replayed in any order. In particular, the replay of is formally specified as follows.
Definition 4.
Consider the domainspecific finite set and the corresponding set of all state transitions as presented in Section II. The replay of , , is a permutation of , i.e., a certain rearrangement of all elements in .
It is immediate that the total number of possible replays of is . As long as the probability for each replay order to occur remains the same (i.e., ), the set of initialization of all transition pairs in remains i.i.d. w.r.t. the same underlying distribution on . We can then formally justify the safety performance of the SV in through the following theorem.
Theorem 1.
[Almost Robustly Forward Invariance Validation] Consider , , the domainspecific finite set and the corresponding set of all state transitions . Let be the set of potentially safe states extracted from through Algorithm 1. Let be the shape of as specified by Definition 3. For a certain replay of denoted by the index as , let . Then, we have that is almost robustly forward invariant with confidence level , and
(6) 
Moreover, is expected to be almost robustly forward invariant with confidence level and
(7) 
As , is also an almost safe set.
Proof.
For any fixed choice of and a particular data replay , the proof of the almost robustly forward invariance property is a direct outcome from Theorem 2 in [weng2021towards]. Furthermore, consider
as a random variable and the occurrence probability for each replay is the same, i.e.,
. The expected is thus obtained in the form of (7). Finally, the almost safe property is a direct outcome of Definition 2. ∎IiiC Finitesampling Operable Domain Quantification Algorithm
So far, we have presented the set construction and set validation steps. The complete algorithm that tackles Problem 1 is summarized in Algorithm 3 and is also conceptually illustrated in Fig. 2.
Note that the derivation of is slightly different from Theorem 1 as replays sharing the same value are grouped together to improve the computational performance. The density and occupancy features can also be derived through (5). Given the finite set and the selected , is determined by the standard shape algorithm [akkiraju1995alpha, kengithub]. In practice, we also use the discussed logarithmic search scheme in Section IIIA to determine the appropriate . Implementation details will also be discussed in Section IV.
We conclude this section by emphasizing that the and the obtained shape are not only embedded with coverage and forward invariance information. The graph induces state transitions that could be used for other safety related applications such as fault tree analysis with backtracking process algorithms [hejase2020methodology, capito2021bpa] and information gain justification [collin2021plane]. The states can also be associated with other safety features available from the raw data such as human driver engagement (e.g., a human may tend to engage within a certain subset of the obtained covering set) and ADAS/ADS signals (e.g., the forward collision warning may only be triggered in a certain subregion). Existing driving data sets collected from realworld and simulators are not comprehensive enough to provide the aforementioned features. Hence, considering those features is regarded as future work.
Iv Case Study
To demonstrate the performance of the proposed safety metric, a series of cases are studied in this section. Detailed configurations are summarized in Table I and explained as follows.
Case  HighD data  Waymo open data set  SUMO  Carla  NCAPAEB  






SV Driver  human 

lane change heuristics 


driver  IDM_0  IDM_1  IDM_0  IDM_1  
SV Type  Car  Truck  Car  Car  Car  Car  




Car 

Car 
HighD data set
The HighD data set [krajewski2018highd] is a data set of naturalistic vehicle trajectories recorded on German highways. The data set comes with a mixture of car and truck drivers operating on straightroad highway segments. It is a wellknown fact that naturalistic driving behavior exhibits statistical consensus in general, but also with discrepancies that depend on the vehicle type. This inspires our study in this section by analyzing the human driver safety performance w.r.t. different vehicle types and different ODDs.
Waymo open data
The Waymo open data [sun2020scalability] used in this study is the motion data set, which is primarily used for training and validating traffic motion prediction algorithms. In this study, we redirect the data set to the safety analysis purpose by taking advantage of the motion trajectories recorded for Waymo’s selfdriving car (SDC) and the surrounded mixedtraffic road users, especially vehicles and pedestrians.
Sumo
The Simulation of Urban MObility (SUMO) [SUMO2018]
is an open source, microscopic, and continuous multimodal traffic simulator. In this study, we compare two parametric selfdriving algorithms with the main difference being the Intelligent Driving Model (IDM) hyperparameters, referred to as IDM_0 and IDM_1. The simulated traffic is created with a variety of vehicles of different dynamics and selfdriving configurations operating in a mapped environment with a mixture of highway and urban roads. A fleet of 20 SVs with each parametric policy is then deployed along with the simulated traffic.
Carla
The simulated mixedtraffic environment in Carla [dosovitskiy2017carla] involves a fleet of vehicles driven by the default autopilot algorithm along with randomly deployed other vehicles and pedestrians.
NcapAeb
The SV algorithms adopted to create this data set are the same as those in the SUMO case. Two parametric IDM algorithms are deployed in a simulated straightroad segment with the lead principal other vehicle (POV) executing the testing policy specified in the NCAP Autonomous Emergency Braking (AEB) cartocar test program [van2017euro]. The program involves 48 testing scenarios with each scenario executed once.
Remark 3.
To differentiate between IDM_0 and IDM_1, IDM_0 is parameterized with a stronger braking capability but is less willing to take extreme maneuvers for collision avoidance (due to a smaller minimum safe distance and a smaller safe time headway). The IDMs used in NCAPAEB and SUMO mostly share similar specifications. However, the SUMO simulator also includes other hyperparameters that may affect the performance, such as the perfectness and the lateral lane change heuristics.
Remark 4.
As a pure datadriven approach, the obtained safety performance evaluations throughout this section are only based on the given data assuming the collected data points are i.i.d. w.r.t. the distribution in the nominal driving environment. This is generally true in simulatorbased tests such as in Carla and SUMO, but is not necessarily valid for realworld driving data sets as the data processing details are largely unknown. As a result, the claimed safety performance from the HighD data set and the Waymo open motion data set do not necessarily represent the corresponding SVs’ actual safety performance.
Before proceeding to the domainoriented safety evaluation outcomes, we first emphasize some featured observations:

The safe operable domain of a certain SV is a joint outcome of the SV’s own driving behavior, the other dynamic road users’ behavior, and the test environment.

Within the same case study (i.e., the same testing behavior and environment), it is in general inaccurate to claim that a certain SV is safer than the other, unless the outcome concurs among all features, i.e., small , large density, and large occupancy.

Comparing the proposed safety metric with the statistical fatality rate inference [fraade2018measuring], given the same confidence level and the same data set , the magnitude of is significantly smaller than the fatality rate value. That is, the inferred fatality rate metric tends to overestimate the risk, especially when the collected finite states are clustered in a specific subdomain in the nominal driving environment. In the meanwhile, the operational domain specific nature of the proposed metric helps establish a more precise safety performance assessment.

SV 

1R(C=0.999)  TTC (s)  TTC Valid Rate  
NCAPAEB  IDM_0  N/A  N/A  1.368 1.051  0.964  13.0199  0.832  0.1960  
IDM_1  N/A  N/A  1.229 2.223  0.244  4.9077  2.464  0.0568  
SUMO  IDM_0  5725.99  0.0019  8.844 0.685  0.378  0.7717  1.752  0.2619  
IDM_1  N/A  N/A  8.581 1.288  0.447  0.5121  2.009  0.3265  
HighD  Car  3276.48  0.0034  8.871 0.666  0.507  0.5732  7.675  0.5807  
Truck  551.81  0.0199  8.951 0.413  0.442  2.9755  1.968  0.4418 
Finally, note that the selection of for the set construction follows the procedure described in Section IIIA, with the initial lower and upper bounds of set to and , respectively. The search terminates if the best shape that wraps in a single polytope is found (the termination threshold is 0.1). That is, to a certain extent, the proposed algorithm not only finds the almost safe operable domain, but also finds the optimal almost safe operable domain. For the highdimensional ODD analysis with a significantly large data set, such as the multivehicle interaction domain with the HighD data set, we also implement a hierarchical means clustering routine to divide the data points into several clusters until all clusters are smaller than a preset threshold. The final shape is then determined by combining all shapes derived from the obtained clusters. Throughout this section, we also have . As a result, all obtained values are derived with a confidence level of at least 0.999.
Iva The leadvehicle following domain
We start with the leadvehicle following domain and analyze three different cases including HighD data set, SUMObased simulation, and a customized simulation executing the NCAPAEB cartocar testing procedure. For the HighD data set, we define with , and and consider all vehicles in all lanes. The extracted trajectories are further classified into two categories determined by the SV’s type (car or truck). For the SUMO simulation, we consider with , and . For the NCAPAEB case, the testing procedure defines the as , and . The experiment results are summarized in Table II and illustrated in Fig. 3.
In Table II, the safe travel distance and the inferred fatality rate are not available (N/A) for some of the cases as there are collisions included in
. The TTC is presented with the average value and the standard deviation of all admissible states and all TTC values are clipped at 9 seconds. A TTC is valid if it is positive (i.e.
). The TTC validate rate is determined as the ratio between the total number of time steps with valid TTC and the total number of time steps in the data. Within each case and each column, the bold font emphasizes the value that indicates the higherrisk driving behavior (e.g., small average TTC and small shape occupancy). We further emphasize some observations as follows.First, the comparison between SUMO and NCAPAEB presents an interesting case of the same set of SVs tested with different testing policies induced by the traffic vehicle behavior. From the statistical summary shown in Table II, with the TTC based evaluation, IDM_1 is considered more dangerous in both the NCAPAEB and the SUMO cases, yet the valid rates are different. On the other hand, the proposed operational domain specific and unbiased safety evaluation considers IDM_1 as the mostly safer behavior because it exhibits a higher probability of remaining inside the operable domain dictated by the shape (small ) with a higher density. However, note that both the illustrated shapes in Fig. 2(a), Fig. 2(b), and the occupancy values in Table II illustrate that the obtained safe operable domains from the two cases (SUMO and NCAPAEB) are different. Recalling Remark 3, IDM_0 is less willing to execute collision avoidance maneuvers. This deficiency is more pronounced in the NCAPAEB case with a more aggressive leadvehicle driving behavior than the SUMO case. IDM_1 thus ends up with a relatively smaller safe operable domain than IDM_0 in the NCAPAEB case, whereas in the SUMO case, the IDM_1’s safe operable domain is larger. In summary, IDM_1’s willingness to brake leads to a safer behavior than IDM_0 in the normal driving environment, yet it also confines itself to a smaller safe operable domain in the NCAPAEB case which is more biased towards the falsification purpose with the other traffic behaving relatively aggressively.
Second, for the HighD case, the proposed metric mostly agrees with the mileagebased fatality rate measure and identifies the naturalistic behavior induced by the class of truck drivers as more dangerous than that of the class of car drivers. This contradicts the TTCbased metric as the class of car drivers exhibits a smaller average TTC. This aligns with the wellknown deficiency of TTC also reported by other work in the literature [weng2020model, weng2021model].
Finally, throughout all the cases, given the same confidence level, the values from the proposed metric all exhibit a significantly smaller magnitude than the fatality rate value (mostly tenthousand times smaller). Fundamentally, this occurs because the inferred fatality rate from [fraade2018measuring] does not have an explicitly defined operable domain. As a result, one would require a larger data set to obtain a similar level of probability to that of our proposed domainspecific metric.
IvB Multivehicle Interaction and VehicletoPedestrian Interaction
SV  Safe Distance (km)  1R(C=0.999)  (C=0.999)  
Car  536.895  0.0205  0.0004  0.1507  0.3505 
Truck  168.042  0.0640  0.0012  0.3461  0.0441 
This section starts with a case study of the HighD case with the multivehicle interaction domain. As the driving environment in the HighD case consists of only straightroad segments, the domain specification shown in Fig. 1 directly applies with . The domain extraction for the left and right regions is confined to the adjacent lanes near the SV’s lane and also excludes some SV lanes with light traffic on the side (mostly with lane ID 5). The results are summarized in Table III and Fig. 4.
For the statistical inferred fatality rate, the truck is considered more dangerous with a short safe travel distance. On the other hand, for the proposed method, the truck is considered more dangerous with a large and a large occupancy value, yet the point density is also large. Moreover, comparing the center column subplots between Fig. 3(a) and Fig. 3(b), at least within the inspected SV velocity range, the car and the truck share similar leadfollowing distances (indicated by the bottom of the frontcenter subplots in green) and rearfollowing distances (indicated by the bottom of the rearcenter subplots in purple). That is, the other traffic vehicles are not staying further away from a truck than they do from a car, nor do trucks maintain a longer following distance from the lead vehicle than cars. This contradicts the intuitive opinion one typically forms about naturalistic driving behavior in the realworld. Finally, comparing the first column with the third column in all three subplots in Fig. 4, one observes that vehicles on the left typically travel at a faster speed than the SV. This aligns with the nature of the data set as the HighD data set is primarily collected from German highways. This observation may not hold as we examine other driving environments, as we shall soon demonstrate.
As we consider the Waymo case and the Carla case, where pedestrian information is available, the vehiclepedestrian interaction domain is involved. Although both cases involve a variety of driving environments including highway, urban roads, intersections, and roundabout, the domain definition considers them as unknown disturbances and uncertainties. The specifications shown in Fig. 1 still apply. For both cases, we have . The side subregion falls between the lateral offsets of and from the SV’s geometric center. The results are summarized in Table IV. Fig. 5 illustrates the multivehicle interaction domain subspace analysis where the vehiclepedestrian states (the last 4 dimensions in the 17dimensional OSS) are ignored. The vehicletopedestrian interaction is analyzed separately in Fig. 6.
SV  Safe Distance (km)  1R(C=0.999)  (C=0.999)  
Waymo  40.778  0.2386  8.8567  1.4591  0.0096 
Carla  399.195  0.0275  0.8060  18.3606  0.0118 
Within the SV velocity range of (m/s) (see the center column of Fig. 4(a) and Fig. 4(b)), the WaymoSDC is more conservative as it maintains a longer following distance. Moreover, the observation also generalizes to the vehiclepedestrian interaction domain where the CarlaAutopilot exhibits a short vehiclepedestrian distance within a large velocity range (see Fig. 6). However, note that these comparisons are not necessarily fair as the driving environments are essentially different.
In comparison with the HighD case, the two analyzed cases have poorer coverage performance, and vehicles are mostly operating at a low speed range given that the driving environments are different. The observation from the HighD case where vehicles on the left run faster is no longer valid as illustrated in Fig. 5. The advantage of having a domainspecific safety analysis can also be shown through the Waymo case in Table IV. Limited by the data availability, the total safe travel distance for the WaymoSDC is short, leading to a large fatality rate (0.2386). In addition, the value is much smaller for the same confidence level.
V Conclusion and Discussions
This paper has presented a novel safety metric that is operational domain specific and provably unbiased for performance evaluation of ADS/ADAS equipped CAVs involving the shape and the almost robustly forward set invariance property. The performance of the proposed method is also demonstrated over several commonly encountered and challenging ODDs with a variety of data sets collected with different fidelity levels. It is shown, provably and empirically, more accurate than many leading measures, observed and predictive safety lagging measures. In comparison with the inferred fatality rate, the domainspecific nature also customizes a more precise safety assessment property.
As discussed in Section III, it is of future interest to expand the almost safe set with richer information related to the dynamic modeling, engagement information, and other safety related features. It is also of practical value to explore more efficient algorithms in deriving the (optimal) shape.