Software Reliability Growth Models Predict Autonomous Vehicle Disengagement Events

12/21/2018
by   Robert Merkel, et al.
Monash University
0

The acceptance of autonomous vehicles is dependent on the rigorous assessment of their safety. Furthermore, the commercial viability of AV programs depends on the ability to estimate the time and resources required to achieve desired safety levels. Naive approaches to estimating the reliability and safety levels of autonomous vehicles under development are will require infeasible amounts of testing of a static vehicle configuration. To permit both the estimation of current safety, and make predictions about the reliability of future systems, I propose the use of a standard tool for modelling the reliability of evolving software systems, software reliability growth models (SRGMs). Publicly available data from Californian public-road testing of two autonomous vehicle systems is modelled using two of the best-known SRGMs. The ability of the models to accurately estimate current reliability, as well as for current testing data to predict reliability in the future after additional testing, is evaluated. One of the models, the Musa-Okumoto model, appears to be a good estimator and a reasonable predictor.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/19/2019

Assessing the Safety and Reliability of Autonomous Vehicles from Road Testing

There is an urgent societal need to assess whether autonomous vehicles (...
07/01/2019

Kayotee: A Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors

Fully autonomous vehicles (AVs), i.e., AVs with autonomy level 5, are ex...
02/16/2022

Estimating Software Reliability Using Size-biased Modelling

Software reliability estimation is one of most active area of research i...
02/02/2021

Reliability Analysis of Artificial Intelligence Systems Using Recurrent Events Data from Autonomous Vehicles

Artificial intelligence (AI) systems have become increasingly common and...
08/19/2020

Assessing Safety-Critical Systems from Operational Testing: A Study on Autonomous Vehicles

Context: Demonstrating high reliability and safety for safety-critical s...
07/31/2022

The Unnecessity of Assuming Statistically Independent Tests in Bayesian Software Reliability Assessments

When assessing a software-based system, the results of statistical infer...
04/12/2018

Robust Safety for Autonomous Vehicles through Reconfigurable Networking

Autonomous vehicles bring the promise of enhancing the consumer experien...

1 Introduction

The safety and reliability of autonomous vehicles (AVs), which are currently under development by a wide variety of companies, is of significant public importance. Regulators in many jurisdictions are currently defining regulatory frameworks for the approval of autonomous vehicles to operate on public roads [1]. Public acceptance of AV technology will depend on deployed AVs being safer for both their passengers and other road users than current human-driven vehicles [2]. Therefore regulatory and operator approval of AVs will require rigorous evidence that AV accident and particularly fatality rates are lower than acceptable thresholds.

Kalra and Paddock [3]

have shown the infeasibility of a naive approach to demonstrating the safety of AV systems. In short, if safety were demonstrated by testing a fixed AV system configuration in conditions reflecting typical usage, a fleet of such vehicles would have to be driven 275 million miles (approximately 441 million kilometres) without a fatality for the probability that the fatality rate for that AV system was lower than for conventional vehicles to exceed 95%. Even ignoring the exorbitant cost of such a process, it is highly unlikely that the AV system software and hardware could truly be kept static for long enough to complete such a testing program. Therefore, an alternative approach is required.

While external stakeholders are likely to be most interested in the reliability of a system as it currently exists, AV manufacturers also have a considerable stake in predicting the reliability of their future AV systems before they are completed. AV development programs by major manufacturers have cost over one billion US dollars [4] and taken over a decade. The profitability of such investments depends on the time and resources required to build a commercially viable product. Therefore, the ability to estimate future safety improvements would allow AV manufacturers to evaluate whether further investments are financially justifiable.

Evaluating and predicting the reliability of an evolving software system is a well studied problem. Software reliability growth models (SRGMs) have been developed for the purpose [5]. SRGMs allow the statistically rigorous estimation of the current and future reliability of software systems as programming faults are rectified through testing and use. Given that only a small minority of vehicle accidents are caused by hardware failures [6], it is plausible that failure rates of AVs can be modelled using techniques developed for software systems.

Previous analyses of accident data from on-road AV test programs have shown the counter-intuitive result that accident rates have not declined [7] over years of testing and development. Aside from the very limited sample, this may be due to the fact that a substantial proportion of accidents will be attributable to a varying degree to the actions of the drivers of other vehicles involved. Attributing responsibility in collisions between vehicles is self-evidently complex and may render any analysis less valid than might be hoped.

However, the publicly available reports of on-road testing by AV manufacturers to the California Department of Motor Vehicles [8] provide a very useful proxy metric for estimating the progress of an under-test AV system. These reports, as well as listing accidents involving AVs under test, list each occasion where a disengagement events occurred. A disengagement event is defined as occuring when a human backup driver either:

  • takes over driving after a warning from the AV’s systems that they were not able to proceed safely in accordance with local driving laws.

  • takes over driving on their own initiative where they judged that the AV was not proceeding safely in accordance with local driving laws.

To achieve full autonomous operation (Levels 4 and 5 according to the widely-adopted SAE taxonomy [9]) the rate of these incidents will have to be reduced to negligible levels – even if the human backup drivers were ultimately overzealous in some interventions, customers are unlikely to accept vehicles that put them in driving situations they would themselves consider too risky. It is also at least plausible that the rate of such events is reasonably well correlated with the rate at which accidents which the AV system could have prevented would occur in the absence of the backup driver.

Therefore, the disengagement rate is at least a plausible initial metric for assessing the reliability of AV systems.

In this paper, we examine:

  • if two well-known SRGMs accurately fit reported disengagement rate data for the two most extensively tested AV systems

  • if the two SRGMs can be used to accurately predict disengagement rates by modelling using a subset of earlier data, and comparing the model predictions with the later data

  • which of the two SRGMs is most useful for these purposes.

2 Background

2.1 Software Reliability Modeling

Software reliability is a measure of the frequency of failures – instances where a software system fails to perform as specified. More formally, the IEEE Software Reliability recommended practice [10] defines software reliability as:

  1. The probability that software will not cause the failure of a system for a specified time under specified conditions.

  2. The ability of a program to perform a required function under stated conditions for a stated period of time.

For our purposes, the first definition is the relevant one, though it is typical to describe reliability by measuring the presence of failures rather than their absence.

A software reliability model (SRM) is therefore a mathematical expression describing the reliability of a system, or more formally “A mathematical expression that specifies the general form of the software failure process as a function of factors such as fault introduction, fault removal, and the operational environment” [10]. In this context, faults are defined as the underlying defects in the software system that are the cause of failures.111The relationship between software faults and failures is a complex one discussed at length in the software engineering literature, but the subtleties are not pertinent here.

Software Reliability Growth Models (SRGMs) are a subset of SRMs, based on the observation of the distinct reliability characteristics of software-based systems compared to ones not including software. In hardware reliability modelling, the primary source of failure is physical deterioration, whereas in software the primary cause of failures is design faults, which, once fixed, do not recur [11, p. 7]222to a first approximation; “regression errors” due to poor source code change management are not uncommon, and bug fixes are often less than perfect. Therefore, software systems often demonstrate a characteristic pattern of reliability, where failures are common in early testing/use, but as the underlying errors are fixed the rate of failure drops and reliability improves.

A variety of SRGMs have been proposed, all making slightly different assumptions about the nature of software faults, the efficacy of bug fixing, and testing/usage patterns. Most such models fall into one of two groups, S-shaped and concave [5], based on the characteristic shape of the model when plotted. Figure 1 shows an illustrative example of each model group. The X-axis of the figure represents “time” (which may or may not be simple calendar time, as we will discuss further) and the Y-axis represents the cumulative number of failures detected from the commencement of testing until time . As can be seen, in the concave model, the rate of failure detection is highest at the beginning of data collection, and decreases as time goes on. In S-shaped models, failure detection rates initially increase as the effectiveness of testing improves, before decreasing as more of the defects in the system are found. In both cases, the rate of failure detection asymptotically approaches zero as approaches infinity.

Fig. 1: Concave and S-shaped SRGMs. Representative examples of concave and s-shaped SRGMs, illustrating how S-shaped SRGMs initially have a low rate of failure detection which increases then decreases.

Empirical studies of software failure data [5, 12] have shown that different model classes are better fits for different projects. Ullah, Morisio, and Vetro [12] studied the performance of a variety of such models using the histories of real software development projects. They found that the Musa-Okumoto model (a concave model) and the Gompertz model (an S-shaped model) fitted empirical data more accurately than other models. Therefore, for this preliminary study, those two models were selected.

2.1.1 Musa-Okumoto model

The Musa-Okumoto model [13] is a popular concave SRGM. It is based on the following assumptions:

  • that failures occur as a nonhomogeneous Poisson process

  • that the number of failures at the start of the process is 0.

  • that expected failure intensity - the rate at which failures are expected to occur - declines exponentially with the number of failures detected (). More formally, , where is the failure intensity at the start of the process, and is a parameter that describes the rate of failure intensity decline.

In the Musa-Okumoto model, the mean value function - the expected number of failures detected at time is designated , and is described by the following function:

(1)

2.1.2 Gompertz model

The Gompertz model [14] is an example of an S-shaped SRGM. Gompertz growth models have been used to predict growth in a variety of domains, including microbiology as well as software engineering (need citation). The mean value function of the Gompertz model is as follows:

(2)

is the total number of failures to be “eventually” detected, and and are parameters to be estimated.

3 Methods

3.1 Data selection

This study uses the publicly available vehicle event reports from AV manufacturers available from the California Department of Motor Vehicles [8].

These reports, which must be provided to the DMV on a yearly basis as a condition of the autonomous vehicle testing permits offered by that state, provide the provide the following information about each AV manufacturer’s program:

  1. A listing of all the vehicles used in testing.

  2. the date, time, and brief description of the reasons for the disengagement.

  3. the total vehicle distance driven in autonomous mode per calendar month.

  4. total vehicle distance driven in autonomous mode per calendar month for each individual vehicle in the test program.

Of the 20 AV manufacturers who submitted testing reports for the 2017 calendar year, only two were selected for analysis: Waymo [15], formerly Google’s autonomous vehicle division, and Cruise Automation [16], the autonomous vehicle program of General Motors. These two companies were selected because they have conducted far more on-road testing in California than the other 18 companies in the program: Waymo has completed over 1.4 million miles of testing since commencing in 2014, and Cruise has completed approximately 141,000 miles of testing since commencing in 2015. The next largest California-based testing program has completed less than 10,000 miles of on-road testing.

This selection is not intended as a judgement on the overall maturity and readiness of AV programs, as the other AV manufacturers may be conducting large-scale testing on private roads or in other jurisdictions, only that there is insufficient data in the California data set to evaluate disengagement event trends for other manufacturers.

3.2 Data preprocessing

In this study, all disengagement events listed in the company reports were included in the analysis. It is likely that in some cases, disengagements occurred as a precaution by the safety driver, and no adverse event would have resulted if the AV continued in autonomous mode. However, the information provided in the data set was insufficient to make such a judgement on whether to exclude any particular disengagement event on this basis.

When applying SRGMs such as Musa-Okumoto, “time” can be defined in a number of ways, including “clock time”, CPU time spent executing the software, or the number of tests run. In the context of AVs, the distance driven is the most natural measure of the testing conducted, particularly given that crash rates are typically quoted in terms of distance travelled.

Unfortunately, however, the exact cumulative distance driven at the time of each disengagement event is not reported in the public data set - only the number of miles driven by the driverless car fleet each calendar month. The exact calendar time of each disengagement event is reported, but there is no information about the total fleet hours driven up to the time of each disengagement event. Inspection of the full disengagement event logs and the accumulated testing suggested that, at least in the early stages of development, on-road testing was irregular and conducted for only a small fraction of each calendar month. This meant that, if calendar time was used as the time variable in modelling, one of the assumptions of the Musa-Okumoto model - that disengagements were a random memoryless Poisson process - did not hold in the data available.

Therefore, for each AV manufacturer, for each calendar month in the data set, we calculated:

  • The cumulative miles driven by the manufacturer’s vehicle fleet up to and including that calendar month; and

  • The cumulative number of disengagement events.

3.3 Model parameter estimation

Both the Musa-Okumoto and Gompertz models are parameterized, and to apply the model to a given software system values for these parameters must be estimated. In a random process, any parameters chosen will usually not result in a perfect fit, and therefore a criterion that specifies the nature of the “best” fit is required. The two definitions of best fit commonly used for this type of estimation in the literature are maximum likelihood and nonlinear least squares.

Maximum likelihood estimation seeks to choose parameters that maximise the likelihood function of the model. The likelihood function of a model with observed data is defined as “the probability of the observed data considered as a function of ” [17]. Therefore, a maximum likelihood estimate chooses the model where the observed data was most likely to have occurred.

Nonlinear least squares estimation seeks to find parameters that minimise the sum of the square of the differences between the produced data, and the values predicted by the model.

In both cases, iterative numerical methods are used to find parameters that minimise or maximise the relevant function.

In linear models with errors that are normally distributed, the best estimate obtained by least squares will also be the best maximum likelihood estimate, but the model here is not linear and there is no guarantee the errors are normally distributed. The existing SRGM literature generally uses ML modelling where possible to obtain parameter estimates. Unfortunately, while the likelihood function of the Musa-Okumoto model is known, the closed forms in the literature require that the data must be in the form of intervals between individual events.

As the data was not available in this form, least squares estimation was used to estimate model parameters for both the Gompertz and Musa-Okumoto models. The scipy.optimize.curve_fit function in version 0.19 of SciPy [18], using Python 3.6.7rc1 on an Ubuntu 18.10 virtual machine instance, was used for all curve fitting.

curve_fit

also estimates a covariance matrix, which provides variance estimates for the model parameters. The delta method 

[19]

was used to calculate 95% confidence intervals for the model.

3.4 Model accuracy assessment metrics

Previous studies comparing the goodness of fit for various software reliability growth models [12] have used the coefficient of determination () to compare the quality of the model fitting achieved. While this metric works very well for comparing linear models, it can give misleading results when comparing nonlinear models [20]. The most sophisticated comparison metrics for nonlinear models require the ability to calculate the likelihood functions for those models, which was not possible given the data available.

Therefore, for this study, the standard error of estimate 

[21] was used to compare the goodness of fit of the models. The standard error of estimate is the mean deviation of the observations from the model prediction. The smaller the standard error of estimate, the better the model fits the data.

3.5 Experimental procedure

To compare the usefulness of the two models for AV data, we adapted the procedure of Ullah et al. [12]. As discussed, we use the standard error of estimate to compare the goodness of model fit, rather than coefficient of determination.

3.5.1 Experiment 1: goodness of fit

To assess which model best fitted the disengagement event data from each manufacturer, a nonlinear least squares fit was computed using the entire data set from each AV manufacturer, for both the Musa-Okumoto and the Gompertz models. The standard error of estimate was computed for each model.

3.5.2 Experiment 2: accuracy of predictions

To assess which model most accurately predicts disengagement event rates, a nonlinear least squares fit was computed for the first two-thirds of the points of each data set from the AV manufacturer, for each model. Both models were then plotted against the actual data for the entire data set.

The standard error of estimate, and 95% model confidence intervals, were computed for each model.

4 Results

4.1 Experiment 1: goodness of fit

Fig 2 shows the observed cumulative number of disengagement events and kilometres driven, and the least squares model fit for the Musa-Okumoto and Gompertz models for the Waymo data set. Visually, the data set contains no indication of an S-shaped curve. Visually, Musa-Okumoto model appears to be a reasonably good fit for the data; the Gompertz model less so. As previously discussed, the use of nonlinear least squares estimation does not permit the calculation of confidence intervals for the model parameters, so there are no error bars on the plot.

Fig. 2: Best-fit models for disengagement events for Waymo data set.

Fig 3

shows the observed disengagement events and best fit models for the Cruise data set. In this case, the Musa-Okumoto model is a far better visual fit than the Gompertz model. This may be due, in part, to the overrepresentation of points where the accumulated distance is small in the model; the least squares estimator weights errors at all provided data points equally, even if they are not equisdistributed on the x-axis. Regardless, the visual similarity of the best-fit Musa-Okumoto model to the observed data is striking.

Fig. 3: Best-fit models for disengagement events for Cruise data set.

Table I shows the standard error of the regression for each model and data set. As would be expected from the graphs, the standard error of estimate for the Musa-Okumoto model is lower than for the Gompertz model, for both data sets.

Data set
Model Waymo Cruise
Musa-Okumoto 2.501 2.468
Gompertz 5.699 7.553
TABLE I: Standard error of estimate for SRGMs

Figures 4 and 5 show 95% confidence intervals for the Musa-Okumoto model for the Waymo and Cruise datasets respectively.

Fig. 4: Musa-Okumoto model fit, with confidence intervals for disengagement events for Waymo data set.
Fig. 5: Musa-Okumoto model fit, with confidence intervals for disengagement events for GM Cruise data set.

4.2 Experiment 2: prediction

Attempts to find model parameters for a model of first two-thirds of the data points using the Gompertz model proved fruitless. SciPy’s curve fitting algorithm was unable to converge to a plausible solution for either data set. Therefore, we present results for the Musa-Okumoto model only.

Figure 6 shows a plot of actual disengagement events for Waymo, compared to a Musa-Okumoto model calculated from the first two-thirds of the monthly data points. Figure 7 shows a similar plot for the Cruise. Points left of the vertical line were in the first two-thirds of the monthly data points and were used in the model parameter estimation. Note that the proportion of the testing kilometres included in each model differs substantially, as it appears Waymo reduced the number of testing kilometres driven in California in the latter part of the data series, where Cruise increased their own testing. Table II shows the standard error of estimate for the two models.

Fig. 6: Predictive model for Waymo data set Cumulative disengagement events for the Waymo data set compared to a best-fit model computed from the first two-thirds of the monthly data points.
Fig. 7: Predictive model for Cruise data set Cumulative disengagement events for the Cruise data set compared to a best-fit model computed from the first two-thirds of the monthly data points.
Dataset Standard error of estimate
Waymo 2.873
Cruise 2.771
TABLE II: Standard error of estimate for SRGMs computed from first two-thirds of monthly data points

In both cases, it appears that the actual data is broadly consistent with the model predictions. The model predicts fewer disengagement events than were present in the real data in the Waymo data set, but more events than actually occurred in the Cruise data set. It is notable that in both cases, but particularly for the Cruise dataset, the confidence intervals are relatively wide compared to the fits with the full datasets.

5 Threats to validity

There are a number of threats to the validity of this study.

The raw data presented in the paper was collected on behalf of subsidiaries of large, publicly listed companies, who face legal sanction if inaccurate or incomplete data was provided. The calculations performed in this paper were conducted using the SciPy library, a well-tested, popular library for performing statistical modelling. The (trivial) Python code used for the calculations, and the raw data, is available from (source).

However, there are a number of potential confounding factors that the data set provides insufficient information to evaluate. The models assumed that testing and debugging practices did not change, and that the number of kilometres accumulated in California testing linearly correlated with the total testing and debugging effort to date. This is known not to be the full picture for Waymo. Waymo conducts testing in a private testing ground [22], performs simulation experiments of variations of real-world events using fuzz testing [23], and was in the process of shifting much of its onroad testing to Arizona towards the end of the reporting period. Unlike in California, in Arizona, companies are not required to disclose disengagement events, or accidents. Fixes for faults found and located through these efforts are presumably incorporated into the cars tested in California, but there is little or no public information as to the extent and timelines of this private testing. While less is known about the testing program of GM Cruise, they are highly likely to also be conducting simulation testing, and testing on private roads.

It is also unknown whether the purpose and scope of on-road testing changed significantly in the periods covered by the data sets. It seems likely that the range of driving conditions in which AV manufacturers would test their vehicles would be relatively narrow initially, and broaden considerably over time. This broadened scope would presumably include higher-risk driving conditions such as unfavourable weather (AV sensors are known to be affected by snow [24])), increased vehicular and pedestrian traffic, and poorer road conditions. If problems are detected in particular conditions, it is plausible that for a subsequent period testing will focus on confirming the effectiveness of those problems, which may affect the disengagement rate for that period.

The robustness of the estimates is somewhat constrained by the inability to use the standard technique for parameter estimation in SRGMs - maximum likelihood estimation. ML models do no assume that deviations from the model are normally distributed, and the use of non-standard (in this domain) estimation techniques increased the risks of mistakes.

Obviously, with a data set of only two autonomous vehicle projects, it is unknown whether other projects will follow a similar reliability growth curve. Nor do the existing data sets show a complete development life cycle, as neither vehicle had reached a point where the manufacturer had sought approval for full autonomous operation without a safety driver.333According to media reports, Waymo plans to launch a fully autonomous taxi services in Phoenix, Arizona, some time in late 2018 [25]. Therefore, it is not clear from the available data whether these SRGMs can provide accurate predictions all the way to the very high reliability levels required for safe commercial operation.

6 Discussion

These results clearly indicate that reported disengagement events in two large AV test programs can be fitted reasonably accurately to standard software reliability growth models. While not as accurate, the ability of the models to predict failure rates is also surprisingly good given all the uncertainties and confounding factors mentioned above. The results also demonstrate that confidence intervals for models can be calculated, allowing upper bounds on disengagement rates, with nominated levels of confidence, can be calculated, on an evolving system. It is even possible to estimate how much further testing is required before the upper bound disengagement rates will be below a desired threshold.

6.1 Viability of reliability growth modeling in AVs

The application of the Musa-Okumoto SRGM to the AV disengagement data suggests that, contrary to the assumptions of Kalra and Paddock [3], it may be possible to calculate mathematically sound estimates of the current and future failure rates of AVs based on past behaviour of earlier versions of the system.

Furthermore, rather than trying to model comparatively rare events – fatalities or accidents – using disengagement events is a more feasible way to assess the the progress of an AV project. It should be noted that using self-reported disengagement events as an mandated threshold criterion for regulatory approval would be risky, as AV manufacturers might be tempted to modify their software and instruct their drivers to not disengage in marginal situations.

6.2 Models

For these two programs, the Musa-Okumoto model was a far better fit to the complete data sets, and was the only model that (at least with the SciPy’s curve fitting algorithm) that resulted in a plausible fit when provided with an incomplete data set. The key assumption of an S-shaped model is that testers initial efforts will be less effective at revealing faults than later efforts as they gain experience in testing the system. The nature of road-based testing, and the nature of the faults reported is not consistent with this assumption, and the data appears to bear this out.

While the simple Musa-Okumoto model was a good fit for the data available, it is unlikely to be the last word in modeling autonomous vehicle failures. Software reliability growth models model testing as a simple numeric quantity, but it is an open question as to whether using real-world testing kilometres travelled is appropriate here given the multifaceted nature of AV testing programs. According to Madrigal [26], many of the improvements in Waymo’s AV program come through their use of simulation, rather than directly from on-road testing. However, the “interesting” scenarios they model are often ones identified in disengagement scenarios. As such, it is possible that simulation, powerful though it is, is ultimately an “amplifier” for real-world testing. That is, it allows the exploration of a huge number of variations of failure scenarios, and gives higher confidence that those failure scenarios will not occur again when similar conditions are encountered in the real world. If so, it may well be that real-world testing kilometres travelled are a reasonable proxy for testing effort, but it remains an open question that can only be clarified with more data.

6.3 Data reporting procedure limitations

As noted in threats to validity, the need to use nonlinear least squares estimation rather than maximum likelihood modeling has increased the risk of invalid assumptions in the models. However, modifications to the data reporting procedure would make it straightforward to compute maximum likelihood models for disengagement event rates. For instance, instead of reporting the number of failures per calendar month and the total distance driven, companies could report the cumulative distance driven by the test fleet at the time of each disengagement event.

6.4 Related works

In this context, we should also consider the work of Huang et al. [27]

, who observe that testing driverless cars by reflecting real-world driving conditions is not a particularly efficient way to improve or measure reliability. Their approach is to consider particular driving tasks (in their example, freeway lane changes), build a statistical model of the key parameters governing the variance in lane changes, skew testing (both physical and simulation) heavily to those areas of the parameter space where failures are likely to occur, and then use the statistical model in reverse to estimate real-world failure rates. SRGMs could, in theory, be used as part of such an approach, as a way to track the decrease in failure rates in specific tasks. It may also be considered desirable to conduct a parallel on-road testing replicating normal driving to validate any Huang-style statistical model - and using an SRGM on this data would still be useful as a way to provide ongoing estimates of vehicle failure rates, to compare with predictions.

Favaro et al [7] examined accident data for the Waymo program through to 2017, and found a simple linear relationship between kilometres travelled and cumulative accidents. They therefore concluded, While accidents are an important metric, it does not follow that there have been no improvements in the function of AV systems. Their analysis does does not take into account that manual intervention by the human safety drivers is likely to have prevented a significant number of accidents, and those disengagements, as shown in this work, have become much rarer over time. Furthermore, it does not take into account the contribution to accidents of human drivers that an AV could not reasonably be expected to avoid. Favaro et al.  also examined the circumstances of disengagement events in some detail.

The present paper, along with Favaro and others who have examined the California DMV data, shows the value of open data sets for enabling research into topics of public importance. However, it also demonstrates the need for expert advice on exactly what data is collected and reported. If the advice of statisticians had been sought, it may well be that cumulative distances driven at the time of disengagement events would have been included in the data set, making maximum likelihood estimates straightforward to calculate.

7 Conclusion

The key contributions of this paper are as follows:

  1. Software reliability growth models (SRGMs) closely fit trends in disengagement events in the two most extensive public road AV test programs in the California DMV public data set, and can therefore be used to rigorously estimate the current disengagement rate.

  2. Software reliability growth models are reasonable predictors of trends in disengagement events in these two programs. This could allow an AV manufacturer or other interested party to predict the expected future disengagement rate, given a specified amount of continued testing and development.

  3. Confidence intervals for model parameters and the model values can be calculated, enabling the estimation of the present or future probability that disengagement rates are lower than a desired threshold.

  4. A representative concave SRGM, Musa-Okumoto, is a better fit to the data than a representative S-shaped SRGM, the Gompertz model.

  5. The mandated format of the California DMV data set does not permit the straightforward use of maximum likelihood model estimation, which is a more robust way of estimating the parameters of an SRGM.

These contributions are of value not only to developers of AV systems, who may be able to use SRGMs to evaluate the present state of an AV system, but also to estimate the time and resources required to achieve safety benchmarks before those resources are invested. They also have implications for regulators, as it offers the potential for rigorous estimates of AV system safety without requiring infeasible amounts of testing.

There are many potential extensions to this preliminary work. As more AV programs mature, an obvious followup is to examine whether the disengagement events for those programs are also effectively modelled using SRGMs. With more data, it should also be possible to consider more rigorously whether the Musa-Okumoto model is really the most applicable to AV road testing data, or if one or more of the many other concave SRGMs is a better fit. It would also be beneficial to investigate integrating the use of SRGMs into the weighted modelling/testing approach of Huang et al. [27].

As well as ML models, There are other candidate techniques for parameter estimation and calculation of confidence intervals, which may be applicable to this data. Bootstrapping [28] can be used to calculate variance for certain types of time-dependent data. Bootstrapping, and other resampling techniques, may permit more accurate confidence intervals to be calculated on the relatively small data sets available here.

The relationship between disengagement events and accidents, the metric of ultimate concern, is also a topic worthy of much further work. While the question of assigning responsibility in accidents is knotty, a working understanding must be found if regulators are to fairly assess the safety of AV systems without unfairly penalizing them for actions of others that human drivers would not be held accountable for. This multifaceted problem will require an interdisciplinary solution, with the social sciences and humanities having at least as much to contribute as science and engineering.

While this paper has concentrated on land-based autonomous vehicles, other types of robotic vehicles face similar concerns. There is currently considerable commercial interest in developing uncrewed, fully autonomous drones for package delivery, and much larger autonomous aircraft for passenger transport [29]. In both cases, there is a need for reliability estimation and prediction, and it is plausible that SRGMs will be useful for this purpose. Therefore, performing a similar analysis to this paper with a drone failure dataset is an opportunity for future work with near-term real-world implications.

In the race to commercialize what is expected to be exceedingly valuable technology, questionable safety practices have been revealed [30] in some AV development programs. Simply trusting assertions that an AV is sufficiently safe is unlikely to, and should not, satisfy the public as to the readiness of an AV system for production use. Whatever the ultimate statistical basis of demonstrating the reliability and safety of autonomous vehicles, it is clear that an explicit, rigorous and statistically sound approach to safety assessment by regulators is required. This paper is, hopefully, a small contribution towards that.

Acknowledgments

The author would like to thank John Grundy and Marcel Boehme for their helpful comments on an earlier draft of this paper.

References