Log In Sign Up

Autonomous Vehicle Benchmarking using Unbiased Metrics

by   David Paz, et al.

With the recent development of autonomous vehicle technology, there have been active efforts on the deployment of this technology at different scales that include urban and highway driving. While many of the prototypes showcased have shown to operate under specific cases, little effort has been made to better understand their shortcomings and generalizability to new areas. Distance, uptime and number of manual disengagements performed during autonomous driving provide a high-level idea on the performance of an autonomous system but without proper data normalization, testing location information, and the number of vehicles involved in testing, the disengagement reports alone do not fully encompass system performance and robustness. Thus, in this study a complete set of metrics are proposed for benchmarking autonomous vehicle systems in a variety of scenarios that can be extended for comparison with human drivers. These metrics have been used to benchmark UC San Diego's autonomous vehicle platforms during early deployments for micro-transit and autonomous mail delivery applications.


Human Driver Behavior Prediction based on UrbanFlow

How autonomous vehicles and human drivers share public transportation sy...

Metrics for the Evaluation of localisation Robustness

Robustness and safety are crucial properties for the real-world applicat...

A Simulation Study of Passing Drivers' Responses to the Automated Truck-Mounted Attenuator System in Road Maintenance

The Autonomous Truck-Mounted Attenuator (ATMA) system is a lead-follower...

A Systematic Comparison of Deep Learning Architectures in an Autonomous Vehicle

Self-driving technology is advancing rapidly, largely due to recent deve...

Emergency Vehicles Audio Detection and Localization in Autonomous Driving

Emergency vehicles in service have right-of-way over all other vehicles....

Disengagement Cause-and-Effect Relationships Extraction Using an NLP Pipeline

The advancement in machine learning and artificial intelligence is promo...

Improving Take-over Situation by Active Communication

In this short paper an idea is sketched, how to support drivers of an au...

I Introduction

Autonomous vehicle technology has been under active development for at least 30 years [1] [2] [3] [4]. Since the time the technology was first conceived [5], a wide range of applications have been explored from micro-transit to highway driving applications but more recently has started to become commercialized. With the variety of use cases in question, one important topic involves safety. This has received the attention of state officials, and in many cases, regulations and policies have been imposed.

In some states, the Department of Motor Vehicles requires a summary of disengagement reports from each entity performing tests on public roads to provide a better understanding on the number of annual interventions each self-driving car entity is generating. In the state of California alone, the Department of Motor Vehicles (DMV) requires autonomous vehicle companies with a valid testing permit to submit annual reports with a summary of system disengagements. At the time this paper is being written, 66 tech entities hold a valid autonomous vehicle testing permit and only one holds a driverless testing permit.111

Even though many of these reports include certain information to estimate the number of disengagements performed in an entire year, most of the publicly available disengagement reports

222 are not time and distance normalized: did the vehicle experience five disengagements during the course of 10 miles or 10,000 miles? And did it experience five disengagements over the course of 10 minutes or 2,000 hours?

Given the lack of spatiotemporal information, in many cases, these unnormalized reports make it impossible to quantify the performance and robustness of the autonomous systems and most importantly quantify their overall safety. As a result, data normalization is required to characterize autonomous vehicle system performance in order to be compared with human driver performance and analyze safety statistics as a whole.

This study aims to shed light on autonomous system technology performance and safety by introducing a set of metrics and tools geared towards benchmarking Level 3 to Level 5 autonomous vehicle systems.333

With the methods introduced in this study, our team plans on open sourcing an online tool for autonomous vehicle benchmarking to encourage autonomous vehicle entities to report their data in order to objectively quantify system safety and long term autonomy capabilities.

Ii Related Work

The areas of autonomous vehicle benchmarking have remained relatively unexplored. Prior related work in the area of benchmarking sheds light on performance measures for intelligent systems in off-road and on-road unmanned military applications[9]. While the performance measures proposed may serve for certain unmanned military applications, autonomous vehicle applications in public road conditions often require safety drivers to ensure the vehicles will not behave erratically and pose danger for road users if failure cases arise.

With road user safety and failure cases in mind, [10] focuses on estimating the number of miles a self-driving vehicle would have to be driven autonomously in order to demonstrate its reliability with respect to human drivers and proof of their safety. This study specifically shows that self-driving vehicles will take tens to hundreds of years to demonstrate considerable reliability over human drivers with respect to fatalities and injuries. In addition, this naturally leads to the questions, how can the autonomous vehicle progress in between be measured objectively?

While certain self-driving car entities have identified the flaws with current disengagement data reported by the DMV [11] [12] [13], to the best of our knowledge, our team is the first to make objective comparisons of autonomous systems by studying their long term autonomy implications using real autonomous vehicle data collected from diverse and realistic urban scenarios.

Iii Metrics

In this section, the metrics and tools used to benchmark an autonomous vehicle during a four-month study at UC San Diego are defined with the goal of fully characterizing the performance of the systems over time.

Iii-a Direct System Robustness Characterization

For direct system robustness characterization, the metrics of choice are given by Mean Distance Between Interventions (MDBI) and Mean Time Between Interventions (MTBI). These metrics provide a normalized means of benchmarking system robustness over time by including temporal and spatial information. By definition, these statistics can be computed as shown in Equation 1 and 2.


While the definitions for MDBI and MTBI are direct, the measurements for distance, uptime and the number of interventions require the data to be separated into two different categories: the first corresponds to the time elapsed and distance traveled in autonomous mode and the second to the time elapsed and distance traveled during manual driving. By separating these into two different sets of data, the effective system robustness can be measured in regards to its dependence on a safety driver if one or more manual interventions are performed. Nevertheless, in order to separate the manual and autonomous data, vehicle disengagement information must be recorded as a function of time as close to real-time as possible; this can be visualized in Figure 1.

Fig. 1: Enable/disable signal as a function of time.

In the figure, manual and autonomous driving segments are represented by orange and blue colors, respectively, where the separation is given by an intervention or a system re-enable signal. Given that a manual intervention could be performed for an arbitrary length of time, it is important to accurately measure the disengagement signals in real-time by associating them with a system timestamp. While these measurements can be performed by manual annotation, this introduces human error. Therefore, in the measurements performed in this study, each autonomous vehicle was retrofitted with a logging device that records the enable and disable signals over time by using system timestamps. This device operates in an encapsulated environment and records serialized data for vehicle pose, speed, enable signals, as well as their corresponding timestamp. Given this data, measuring the time elapsed between a disengagement and a re-enable signal can be measured by the difference in timestamps. On the other hand, two methods can be employed for measuring the distance traveled in between any two given timestamps and , where as shown in Equation 3 and 4–where the vehicle pose at time is given by and speed is given by . For the measurements performed in this study, Equation 3

was used for estimating distance given that vehicle pose estimates are provided with a high degree of precision by the LiDAR based Normal-Distributions Transform localization algorithm

[7]. The devices used for these measurements are also introduced in our previous work on the lessons learned from deploying autonomous vehicles [6] and a high level description will be provided in the next section.


By measuring the distance covered by the ego-vehicle along with its associated uptime in between a disengagement and a re-enable signal (manual mode) or in between re-enable signal and a disengagement (autonomous mode), MDBI and MTBI can be extended to cover both, manual driving and autonomous driving as shown in Equations 5-8


, , and effectively measure the overall system robustness of an autonomous vehicle but also provide additional measures on how dependent the system is on a safety driver if any disengagements are performed: and measure the average distance and time an autonomous car is capable of driving without any interventions, while and measure the average manual input required by a safety driver in terms of distance and time elapsed in a mean sense. During actual measurements, special attention must be paid for handling divide-by-zero errors, for the case in which zero interventions are performed.

Iii-B Intervention Maps

Although with the metrics introduced, comprehensive statistical analysis can be performed across multiple vehicles, the environments and roads an autonomous vehicle drives on can highly influence the metrics, i.e, did the vehicle drive on a testing track, on the highway, or did it engage in high-traffic scenarios? Therefore, the quality of the data being benchmarked matters. To incorporate the diverse environments an autonomous vehicle must navigate through into our benchmarking tools, we introduce the concept of intervention maps.

Intervention maps are specific to testing routes or geographical areas during benchmarking and are encoded in an occupancy grid format that contains normalized disengagement counts over time. This information can be extracted by associating disengagement information with spatial data as given in Algorithm 1. With this intervention occupancy map, the normalized values can be mapped to a color gradient. Furthermore, by declaring a time range for a particular location, these maps can help visualize disengagement patterns and also provide a sense on the quality of the data based on the location.

Data: Enable/disable signal , Vehicle Pose
Result: Normalized intervention occupancy grid M
#Disengagement and pose association DBWPose = [] M = [][] for  and  do
       DBWPose.append((, , closest_timestamp))
end for
#Populate occupancy grid for  do
       if  then
       end if
end for
#Normalize M (M)
Algorithm 1 Intervention Count and Normalization using an occupancy grid.

Iii-B1 Enhancing Intervention Map Representation

By discretizing the map representation for disengagements over time, areas with difficult scenarios or edge cases can be observed depending on the number of trips performed along a given trajectory. While in the Results sections, a number of patterns are identified based on the observations from intervention maps, to provide additional context to the information that is being visualized, additional road network information can be incorporated.

This leads to the proposition of a method that can be used for quantifying the quality of the data being benchmarked. In this case, every trip is separated into individual road segments depending on a set of predefined conditions: (1) Unstructured Road, (2) Regular Road, (3) High-Traffic Road, (4) Freeway, and (5) Development/Private. An unstructured road corresponds to road segments without explicit lane definitions that include dynamic interactions with other road users such as alleys and pedestrian walkways. Regular and High-Traffic roads on the other hand correspond to well-defined and roads with speed limits and fully defined right-of-way rules–with the only difference being traffic density. Freeway road segments correspond to roads with continuous lane definitions and no intersections. Development or private roads correspond to testing-and-evaluation road segments that are well-controlled for system development whereas (1)-(4) correspond to realistic and uncontrolled environments. Lastly, each road segment is associated with a speed limit.

A sample occupancy grid map with arbitrary road definitions and types can be seen in Figure 2. With the distance of each road segment and the class types, each trip or planned mission can incorporate road information in terms of a percentage of the total distance traveled. For instance, the route shown can be described as a combination of 180m of unstructured distances, 2,860m of regular roads and 4,300m of freeway segments. In other words, this particular trip corresponds to 2.4% unstructured roads, 39.0% regular roads, and 58.6% freeway road segments.

Fig. 2: Sample intervention map with different route types.

Iii-C Autonomous vs Manual Driving Benchmarking

An extension to , , , and , involves human driver to autonomous system comparison. Table I corresponds to an additional set of metrics introduced in our previous work [6]

that can further explain the differences between human drivers and autonomous vehicles in terms of energy consumption, maintenance cost, and control. For example, depending on the steering, acceleration and braking control inputs, more energy may be required to drive along the same routes if a system overcompensates for small errors. As a result, this can impact energy consumption, brake and tire wear. These cumulative effects can affect the overall cost of ownership of a vehicle, as well as the environmental impact. For benchmarking purposes, the measured steering, acceleration and braking status reports can be compared in the frequency domain for autonomous and manual driving.

While in the experiments section, human driver data is not included for direct comparison, these methods have been used for benchmarking level-4 autonomous trucks as part of a joint TuSimple/UC San Diego effort.[14]

Trigger Metric Type
Energy Miles per Gallon (MPG) or Charged Consumed Continuous
Maintenance Cost Brakes and Tire Wear Continuous
Up-time Time Elapsed Per Trip Event Driven

Speed, Acceleration, Steering Angle Fourier Transform

TABLE I: Metrics for bench marking autonomous vs manual driving.

Iv Data Collection

As part of a collaborative effort between UC San Diego’s Autonomous Vehicle Laboratory (AVL), Mailing Center, Fleet Services, and Police Department, a GEM e6 electric vehicle (Figure 3) retrofitted with a complete drive-by-wire system and full sensor suites was used for conducting field tests at the UC San Diego campus. The design strategies and implementations used in the course of this study are described in [6].

Fig. 3: UC San Diego’s autonomous mail delivery vehicle carrying packages and mail.

Iv-a Vehicle Signals Recorded

For the data collection process, our team worked closely with the mailing center to deploy the vehicles for autonomous mail delivery applications over Summer and Fall 2019 while continuously monitoring the systems and collecting data. The vehicle operated under highly dynamic and stochastic environments such as areas with high pedestrian, vehicle and construction activity. To record the various signals required for benchmarking, two tools were used as the basis for data logging: the ROSBAG format [8], as well as a Raspberry Pi logging device that received serialized data and stored it in SQLite databases. Table II

corresponds to the different signals recorded as functions of epoch/Unix timestamps. It should be noted that for every autonomous mission, manual notes were taken to log the type of interventions performed and the weather conditions. These notes are most useful for understanding bottlenecks and improving system performance.

Signal Representation
Vehicle Pose (local map frame) (meters) (quaternion)
GPS Latitude Longitude Altitude (ft)
IMU (m/s) (s)
Vehicle Speed (m/s)
Vehicle Target Speed (m/s)
Enable/Disable Signal - Disabled - Enabled
Acceleration (Unitless)
Brake Control (Unitless)
TABLE II: Vehicle Signals Recorded.

Iv-B Missions

The autonomous mail delivery missions performed in this study consist of two routes within the UC San Diego campus: Warren College and Sixth College–where a trip or mission is defined to be as a round trip from the mailing center to the drop point and back. Round trip distances to Warren College and Sixth College correspond to 1,903m and 1,588m, respectively. In total, there are 24 trips to Warren College and 29 trips to Sixth College.

For the intervention map representation of the areas covered, the different segments have been classified as either unstructured or regular roads. A map generated using the vehicle pose with the corresponding road types is represented in Figure

4, where a trip to Warren College consists of 412m of unstructured road segments and 1,492m of regular road segments. On the other hand, a trip to Sixth College consists of 916m of unstructured road segments and 672m of regular road segments. In other words, 21.6% of Warren College trips correspond to unstructured road navigation and 57.7% of Sixth College trips correspond to unstructured road navigation.

For every trip performed, a trained safety driver was responsible for supervising the vehicle continuously. At the same time, a second team member recorded manual notes about trip information, intervention details, as well as monitored the system.

Fig. 4: Warren and Sixth colleges routes with road type information and speed limits.

V Results

V-a MTBI and MDBI Results

Between the summer and fall 2019 mail delivery missions, the data from a single autonomous vehicle corresponds to more than 89.9km in autonomous mode. This also corresponds to 6.9 hours of data while the autonomous system was engaged without safety driver intervention.

By separating the data between manual driving from autonomous driving segments using the enable/disable signal toggle changes, MTBI and MDBI measurements were estimated. These are represented in Table III for summer and fall quarters respectively. The collective statistics from both quarters are shown in the third row.

Summer 2019 414.201 113.82 24.0 12.77
Fall 2019 283.08 84.44 19.25 11.54
Overall 380.42 106.25 22.77 12.46
TABLE III: MDBI (meters/intervention) and MTBI (seconds/intervention) intervention summary for summer and fall quarters

From the and metrics in Table III, one can infer that, on average, the vehicle drove autonomously 380m or for 106 seconds before an intervention was made. In terms of the safety driver dependability that the and metrics model, on average, the safety driver intervened for 22.77m or for 12.46 seconds. Furthermore, it can be observed that the statistics significantly vary between summer and fall quarters. This significant difference can be explained by campus traffic and ongoing activities experienced early in fall quarter. In the fall, the mail delivery routes experience higher traffic and foot activity from students moving in or starting classes. Separating these results based on time and testing location can help explain trends and traffic patterns. At the same time, it is important to estimate collective averages to make note of the impact of the software release versions on the overall robustness.

V-B Intevention Map

By applying the intervention map tools introduced, the automatically generated occupancy grid map with raw intervention count data (unnormalized) can be seen in Figure 5

. For better visualization a super-resolution image has been vectorized manually as shown in Figure

6. This figure corresponds to the aggregate data from summer and fall quarters and includes construction zones to better understand the campus dynamics.

Fig. 5: Automatically generated intervention maps for summer (left) and fall (right) 2019 quarters.
Fig. 6: Overall intervention map

In general, the areas with higher interventions occur around intersections but also along unstructured environments and construction sites. Without including this information, it is not straightforward to identify short-comings while processing large collections of data. More specifically, the Warren College mailing center path corresponds to a fork between a wide pedestrian walkway and the main road that is used during mail delivery. While the autonomous vehicle is permitted to drive along those areas while enforcing a 2m/s speed limit, the stochastic interactions with pedestrians are challenging.444By law, the campus speed limit is set to 25mph but in order to ensure safety along pedestrian-shared paths, the autonomous vehicle must adjust to different roads. Therefore, some roads require speed adjustments to be performed. This same pedestrian walkway is protected by metallic bollards, requiring a manual intervention quite often since the spacing of the bollards leaves approximately 11cm of clearance on each side of the vehicle.

As previously noted, the data collected from both quarters is a combination of regular and unstructured roads. Out of the 1,903m round-trip to Warren College, 411.96m correspond to unstructured road segments and 1,492m correspond to regular roads. On the other hand, out of the 1,588m round-trip to Sixth College, 916.2m correspond to unstructured roads and 671.98m to regular roads. In terms of the road categories considered here, Figure 7 allows us to visualize the variation and complexity on the types of roads in which the vertical axis corresponds to the distance for each road category. This illustrates the importance of the quality of the data being benchmarked: while the autonomous vehicle covered similar overall distances to each college, the variation between regular and unstructured roads is significant.

Fig. 7: Distances travelled for Warren and Sixth colleges routes.

Vi Conclusion and Future Work

With the autonomous vehicle data collected from mail delivery missions at UC San Diego during the an initial deployment phase, the overall vehicle performance has been quantified in terms of its capabilities to operate without assistance ( and ), its dependability on human input ( and ), and by utilizing the concept of intervention maps and the type of road conditions that are influenced by variation and the quality of the data. While, in a mean sense, the autonomous mail delivery vehicle required a safety driver intervention every 380m with an average human intervention lasting 23m, the techniques introduced in this study have provided a means of analyzing patterns from the mail delivery missions that are being actively used to address system shortcomings such as improvements on pedestrian and vehicle intent recognition, localization and dynamic planning. To encourage other autonomous vehicle entities to benchmark their autonomous vehicle systems with methods proposed, our team plans on open-sourcing the data collected from the mail delivery missions along with an online tool to objectively compute the overall system robustness as a function of the quality of the miles traversed and the georeferenced locations of the data collected. We expect that the dissemination of these methods and tools will raise awareness on the overall performance of state-of-the-art autonomous vehicle technology in order to better understand the shortcomings of today’s technology and collectively design better performing systems.


We appreciate the support from Timothy Wheeler and Scott Driscoll from the UC San Diego Mailing Center, as well as Shiqi Tang and Andrew Liang from the AVL for assisting on multiple parts of this project and maintaining the vehicles. We are also grateful for the support we have received from campus operations, facilities and police station. Without their support, this project would not be possible.


  • [1] D. Pomerleau. Ralph: rapidly adapting lateral position handler. In Proceedings of the Intelligent Vehicles ’95. Symposium, pages 506–511, Sep. 1995.
  • [2] Todd Jochem, Dean Pomerleau, B. Sarath Chandra Kumar, and J. Scott Armstrong. Pans: a portable navigation platform. Proceedings of the Intelligent Vehicles ’95. Symposium, pages 107–112, 1995.
  • [3] Robert D Leighty. Darpa alv (autonomous land vehicle) summary. Technical report, ARMY ENGINEER TOPOGRAPHIC LABS FORT BELVOIR VA, 1986.
  • [4] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, Kenny Lau, Celia Oakley, Mark Palatucci, Vaughan Pratt, Pascal Stang, Sven Stro- hband, Cedric Dupont, Lars-Erik Jendrossek, Christian Koelen, Charles Markey, Carlo Rummel, Joe van Niekerk, Eric Jensen, Philippe Alessandrini, Gary Bradski, Bob Davies, Scott Ettinger, Adrian Kaehler, Ara Nefian, and Pamela Mahoney. Stanley: The robot that won the darpa grand challenge. Journal of Field Robotics, 23(9):661–692, 2006.
  • [5] Ernst D. Dickmanns. Dynamic Vision for Perception and Control of Motion. Springer, London, 2007.
  • [6] David Paz, Po-Jung Lai, Sumukha Harish, Hengyuan Zhang, Nathan Chan, Chun Hu, Sumit Binnani, and Henrik Christensen. Lessons learned from deploying autonomous vehicles at UC San Diego. In Field and Service Robotics, Tokyo, JP, August 2019
  • [7] Martin Magnusson. The Three-Dimensional Normal-Distributions Transform — an Efficient Representation for Registration, Surface Analysis, and Loop Detection. PhD thesis, 12 2009.
  • [8] Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. Ros: an open-source robot operating system. In ICRA Workshop on Open Source Software, 2009.
  • [9] James Albus and Senior Fellow. Metrics and Performance Measures for Intelligent Unmanned Ground Vehicles. 2003.
  • [10] Nidhi Kalra, Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice, Volume 94, 2016, Pages 182-193
  • [11] Kyle Wiggers. Aurora Urges Autonomous Vehicle Industry to Adopt Better Safety Metrics. [Online]. Available: [Accessed: 29- Feb- 2020]
  • [12] Ryan Beene. Self-Driving Car Industry Needs Better Metrics, DOT Official Says. [Onine]. Available: [Accessed: 29- Feb- 2020]
  • [13] Kyle Vogt. The Disengagement Myth. [Online]. Available: [Accessed: 29- Feb- 2020]
  • [14] FreightWaves. Study Finds TuSimple Trucks At Least 10% More Fuel Efficient Than Traditional Trucks. [Online]. Available: [Accessed: 29- Feb- 2020]