Deep Neural Networks (DNNs) have seen a surge in popularity over the past decade, and their use has become widespread in many fields including safety-critical systems such as medical diagnosis and, in particular, autonomous cars. The latter have driven millions of miles without human intervention [Waymo, CADis], but offer few safety guarantees. This has led to erroneous edge-case behaviours and unforeseen consequences [TeslaCrash]. Thus, there is an urgent need for methods that are capable of accurately detecting, analysing and diagnosing such erroneous behaviours.
A Bayesian Neural Network (BNN) is a neural network with a prior distribution on its weights. BNNs have the ability to capture the uncertainty within the learning model, while retaining the main advantages intrinsic to deep neural networks [mackay1992practical]. As a consequence, they are particularly appealing for safety-critical applications, such as autonomous driving, where uncertainty estimates can be propagated through the decision pipeline to enable safe decision making [mcallister2017concrete]. Consider, for example, a self-driving car that, while driving, finds an obstacle in the middle of the road. Then, the controller may be uncertain on the steering angle to apply and, in order to avoid the obstacle, may choose angles which turn the car either right or left, with equal probability. Nevertheless, if we consider the optimal decision according to this steering angle distribution and a squared loss, then the controller will simply select the mean value of the distribution [bishop2006pattern] and aim straight at the obstacle. As we will show later (Definition 2), having precise quantitative measures of the BNN uncertainty facilitates the detection of such ambiguous situations.
In this paper we develop a novel framework for evaluating the safety of autonomous driving using end-to-end BNN controllers, that is, controllers in which the end-to-end process, from sensors to actuation, involves a single BNN without modularisation. Our framework can be configured with any simulator and assumes that trajectories can be sampled efficiently and are endowed with a probability measure. We demonstrate how to obtain a priori statistical guarantees on the safety of the application of the BNN in a given scenario. In particular, we consider both probabilistic safety, which is the probability that the controller will keep the car safe for a given time horizon, and real-time decision confidence, which is the probability that the BNN is certain of a given decision. By using concentration inequalities, such as Chernoff bounds [chernoff1952measure], we show that both measures can be estimated with arbitrarily stringent a priori guarantees.
We evaluate our methods on experiments performed on the CARLA driving simulator [Dosovitskiy17], where we consider a deep end-to-end controller given by a modified NVIDIA’s PilotNet (formally known as DAVE-2) neural network architecture [bojarski2016end], which we train with three different BNN inference methods, Monte Carlo dropout [gal2016dropout], mean-field variational inference [blundell2015weight], and Hamiltonian Monte Carlo [neal2011hmc]. We consider different training scenarios, including obstacle avoidance and driving on a roundabout, demonstrating how to quantify the uncertainty of the controller’s decisions and utilise uncertainty thresholds in order to guarantee the safety of the self-driving car with high probability. In summary, this paper makes the following main contributions:
We present a framework for evaluating safety of autonomous driving with end-to-end BNN controllers, which is based on a simulator and allows one to obtain and quantify the quality of uncertainty estimates for the controller’s decisions.
We design a statistical framework for evaluating safety of BNN controllers with high probability with a priori statistical guarantees.
We show that this statistical framework can be used to evaluate model robustness to changes in weather, location, and observation noise.
We empirically demonstrate that our real-time statistical estimates can be used to avoid a high percentage of collisions.
I Related Works
Deep end-to-end controllers are rising in popularity as the state-of-the-art method for autonomous driving. Examples of such controllers include CNNs, [chen2017end] and [bojarski2017explaining]
, and fully convolutional networks with long short term memory (FCN-LSTM),[xu2017end]. Prior to end-to-end controllers, there is a rich literature on detecting anomalies from sensor output [isermann1984faultdetection]; however, these methods deal with when sensor outputs deviate from normal ranges and do not detect when the model itself is unsafe. For this, quantification of model and data uncertainty, extracted from BNNs, can be used [kendall2017uncertainties].
To date, the advantages of BNNs have been observed in small test cases. In [lee2018ensemble], an ensemble of BNNs over different modalities (stereo imaging and GPS) are used in order to drive a 1:5 scale car around an oval track. Further, in [kahn2017uncertainty], they use bootstrapping and dropout in order to generate uncertainty estimates which allow an RC car or quad-rotor drone to predict and avoid collisions.
Beyond these simplified domains important work is being done in scaling end-to-end BNN models to real-world test cases. In [amini2019variational], the authors use a BNN to incorporate GPS and image data to make predictions about long term navigation and localization. [huang2019uncertainty] looks at using uncertainty from a BNN to produce both a distribution of possible future trajectories of a car at an intersection, and a confidence estimate for varying time horizons, with the final goal of augmenting the result of this with a physics-based predictor using confidence estimates. Additionally, in [feng2018towards] BNNs are used on real-world LiDAR data in order to more safely localize objects.
While these works do well to scale BNNs to more pratical cases, they are not concerned with analysis of the safety of deployment for BNNs. For this, very few works have been completed. [quilbeuf2018statistical] looks at using statistical model checking (SMC) to evaluate the probability of two different subsystems of an autonomous vehicle controller (therefore not an end-to-end controller) meeting specific key performance indicators (KPIs). Although the results of this paper demonstrated a high probability of meeting the KPIs, the simulator used lacked realistic detail.
We further the investigation into safe deployment of BNNs as end-to-end controllers by scaling exact and approximate inference techniques to realistic simulators. This allows for the contextualization of pointwise uncertainty estimates and enables their use in real-time decision making. Understanding that uncertainty increases for certain inputs (as in [amini2019variational, feng2018towards, huang2019uncertainty]) is important insofar as it encourages the use of uncertainty during deployment; however, evaluating the uncertainty in a pointwise (per image) fashion does not allow us to reason about emergent properties of the incorporation of uncertainty and their safety [cardelli2019robustness]. In order to create safe plans for autonomous vehicles that incorporate uncertainty, we must evaluate the fundamental impact of decisions which are made on the basis of uncertainty (e.g. slowing down when uncertain, or returning control to the user).
Ii-a Bayesian Neural Networks and Inference
For a test input a BNN with output units and an unspecified number (and kind) of hidden layers is denoted as , where, a weight sampled from the distribution of , we denote with the corresponding deterministic neural network with weights fixed to and with the resulting distribution of . In the case of classification, we consider classification with a softmax likelihood model. Let be the training set. Then, we assume a prior distribution over the weights, i.e. 111
Usually depending on hyperparameters, omitted here for simplicity., so that learning for the BNN amounts to computing the posterior distribution over the weights, , via the application of Bayes rule. Unfortunately, because of the non-linearity introduced by the neural network architecture, the computation of the posterior cannot be done analytically [mackay1992practical]. Hence, various approximation methods have been studied to perform inference with BNNs in practice. Among these methods, we consider Hamiltonian Monte Carlo (HMC) [neal2011hmc], Mean Field Variational Inference (VI) [blundell2015weight] [graves2011practical], and Monte Carlo Dropout (MCD) [gal2016dropout].
Hamiltonian Monte Carlo (HMC)
proceeds by defining a Markov chain whose invariant distribution is, and relies on Hamiltionian dynamics to speed up the exploration of the space. Differently from the two other methods discussed below, HMC does not make any assumptions on the form of the posterior distribution, and is asymptotically correct. The result of HMC is a set of samples that approximates .
Mean Field Variational Inference (VI) proceeds by finding a Gaussian approximating distribution in a trade-off between approximation accuracy and scalability. The core idea is that depends on some hyper-parameters that are then iteratively optimized by minimizing a divergence measure between and . Samples can then be efficiently extracted from .
Monte Carlo Dropout (MCD) is an approximate variational inference method based on dropout [gal2016dropout]. The approximating distribution takes the form of the product between Bernoulli random variables and the corresponding weights. Hence, sampling from reduces to sampling Bernoulli variables, and is thus very efficient.
Iii Uncertainty Quantification for Autonomous Driving
In this section we first give a description of our framework for evaluating BNN controllers and then introduce different measures for safety characterization in self-driving cars. In particular, in Definition 1 we define probabilistic safety, which is the probability that a BNN controller will keep the car safe, while in Definition 2 we define real-time decision confidence as the probability that the BNN controller is certain of its decision at the current time.
Iii-a Conceptual Description of our Framework
We model the autonomous driving scenario considered in this paper as a discrete-time controlled stochastic process ( [gihman2012controlled]. is a probabilistic model that describes the status of the entire system and takes values in a state space , which includes information on the position, velocity and acceleration of the car, as well as that of all the other vehicles, pedestrians and obstacles on the map. Intuitively, in this paper, just represents a white-box system that we assume we can simulate arbitrarily many times.
The control space of the process, which represents the set of variables a controller can modify to drive the behaviour of , is denoted by and is typically given by steering angle, braking and acceleration values of the ego car. We assume the controller can only observe a noisy image of the state space coming from the available sensors. Hence, is only partially observable. We denote by the observation space, which is the set of all possible observations. Intuitively, given the current state of , the controller receives an observation of , and synthesizes an action based on this observation. Then, transitions to a new state at time . Given the evolution of is probabilistic, as traffic, weather conditions, and other variables are uncertain.
A (memoryless and deterministic) control strategy for , associates to a given observation an action. In this work, as explained in detail in the next section, we train a BNN controller to synthesize . We denote a path of by . is a sequence of states and actions in an execution of the system. Given a strategy we assume there exists a well defined probability measure over the paths of such that, for , is the probability that is in at time given . For instance, this measure is well defined for POMDPs [chatterjee2016decidable]. However, the uncertainty quantification techniques derived in this paper will work also for more general, possibly non-Markov, processes.
Iii-B Safety Measures for Autonoumous Driving
The first problem we consider in Definition 1 is that of computing the probability that a given strategy synthesized by the BNN keeps the car safe. This probability can be used for planning and to certify that a given controller is safe with high probability given the available information. Computing this value can be done in any simulator. Prior to the deployment of an autonomous vehicle it is common for large companies to evaluate the safety of specific test cases [reynolds2018uber]. As a consequence, we believe that a quantifiable notion on the safety of a given controller is pivotal in order to certify a controller, especially if this incorporates learning elements.
(Probabilistic Safety) Let be a safe set, denote a path of be a time horizon, and be a given policy. Compute
Then, for , we say that is safe in iff
is satisfied if the probability that a path of is safe during the interval is greater than a threshold. We should also stress that similar probabilistic measures of safety are widely used to certify cyber-physical system models [abate2008probabilistic, Bortolussi:2019:CLM:3347091.3331452].
As explained in greater detail in the next section, in order to synthesize a control strategy , we train a BNN and we obtain that, for an image , is determined by the BNN predictions. However, notice that is still deterministic. Hence, it does not take into account the uncertainty in the model predictions, which is intrinsic in the BNN and could be used to quantify the confidence of the model in its decisions. To tackle this issue, for , in the following definition, we consider a notion of trust of based on the probability mass of the BNN around . The following problem is stated for regression tasks, but can be trivially extended to classification problems.
(Real-time decision confidence) Given let the observation received at time , a wieght sampled from and . Compute
Then, we say that the decision at time is confident iff
Note that the probability measure in the above definition comes from the distribution of the weights in the BNN. In fact, by definition of probability, we can equivalently write , where is the indicator function for event . Hence, real-time decision confidence, as defined in Definition 2, seeks to compute the probability mass in a ball around
and classify a decision as certain if the resulting probability is greater than a threshold. Definition2
can be violated either when there is high uncertainty (i.e., variance is large) or when the control distribution is multimodal and the most likely mode ofis far from . In the experimental results section we show that this measure of uncertainty can be employed together with commonly employed measures of uncertainty, such as mutual information [shannon2001mathematical], to quantify in real time the degree that the model is confident in its predictions and can offer a notion of trust in its predictions.
Iii-C A Statistical Framework for Safety Evaluation
For the computation of and , we consider a statistical framework, inspired by the techniques developed for statistical analysis of probabilistic models [cardelli2019statistical, legay2010statistical]. In particular, we observe that the satisfaction of both and can be seen as Bernoulli random variables, which we can observe by sampling from , the weights of the BNN in case of real-time decision confidence, and by sampling in case of probabilistic safety. After we collect samples of each random variable, we can build the following empirical estimators
where are weights sampled from and are paths sampled from Then, for an arbitrary absolute error bound and confidence , we obtain that if
then for , it holds that
The above bound is based on Chernoff bounds [chernoff1952measure]. Nevertheless, also other sequential schemes, potentially requiring less samples, could be employed [cardelli2019statistical]. However, the bound in Eqn (1) has the advantage to allow one to determine the required sample size for a given precision before performing the experiments. Hence, it can be trivially parallelized.
Bayesian End-to-End Controllers for Self Driving
In the experiments considered in this paper we consider a setting where the observation space is given by images from a single camera input, placed on the front centre of the car facing forwards. The control space is the steering angle. Nevertheless, we should stress that the techniques developed in this paper are general and not limited to this scenario.
Iii-D Data Acquisition and Processing
The experiments in this paper use the CARLA simulator, a state-of-the-art, open-source simulator for autonomous driving research [Dosovitskiy17]. However, we stress that any simulator can be used within this framework, assuming it can simulate car trajectories, and generate images that can be used by the controller. All training data, which consists of (image, steering angle) pairs, was acquired within the CARLA simulator, either through manual driving or use of the built-in autopilot. During experiments, we also make use of the cars trajectory data, which is provided in the form of a list of GPS coordinates from the simulator. Images are converted to grayscale and scaled to a size of 64 48 pixels, and steering angles (recorded between -1 and 1) are binned into intervals of tenths. The data recorded consists of three scenarios: a right turn on a roundabout and a straight segment of road with and without an obstacle (stationary vehicle). It is possible to vary the weather within the simulator, however the weather condition in all of the training data is “clear noon”.
We use a modified PilotNet [bojarski2017explaining] architecture for the experiments in this paper. Traditionally, steering angle prediction has been treated as a regression problem. However, it has been shown that posing regression tasks as classification tasks often shows improvement over direct regression training [rothe2015dex]
We fix the convolutional layers and first fully connected layer, and use the final layers for uncertainty extraction (similarly to [ovadia2019can]). For MCD, we use concrete dropout [gal2017concrete]
on the final three layers (and leave the fourth fully connected). For VI and HMC, we use four fully connected layers, where the input to the first layer are the features extracted from the final fixed network layer.
In our experiments, for an observation we have that the BNN decision, is given by the most likely class. However, we stress that other choices for
are possible according to the particular loss function (see e.g.,[bishop2006pattern]) and the methods presented in this paper are independent of the criteria for assigning .
Iii-E Network Training
This section describes how the networks for each inference technique were trained. Full details of hyper-parameters can be found in the code associated with this work.
MCD The cross-entropy loss function is used, along with the ADAM optimizer with a learning rate of and the dropout probabilities tuned with concrete dropout, which converged to (, , ). The batch size is and it was trained for epochs.
Features are first extracted from the final fixed layer of the network using the weights from the MCD network for these initial layers. Then, we impose prior distributions on the weights of the final four, fully-connected layers. These are normal distributions with meanand variable variance. Inference was then performed using the Edward python library [tran2016edward], and the posterior is also in the form of a normal distribution.
HMC The prior distributions for the HMC networks are as above, however the posterior here is an empirical distribution based on sampling with the HMC algorithm. We use 10 steps of numerical integration prior to judging the acceptance criteria of each sample.
In this section, we describe an extendable experimental set up for computing the measures in Definition 1 and Definition 2. We first show that use of the measure in Definition 2 in conjunction with classical measures of uncertainty can greatly increase the safety of an autonomous vehicle when it is in unfamiliar scenarios. We then consider probabilistic safety as defined in Definition 1 and we show that this measure can be effectively used in order to identify problematic scenarios in which further data acquisition should occur.
Iv-a Real-time Collision Avoidance
In Figure 1, we can see an example of a collision avoidance test set up. We place a vehicle 40 meters away from an obstacle in fixed weather conditions along a single roadway. We then train a BNN controller on data collected from safe human driving in this scenario. Below, we describe a general framework for performing collision avoidance which generalizes to any scenario one would like to test. Further, the system that we use can be implemented for any BNN that is trained to drive autonomously, and can detect situations in which the car is uncertain in order to improve safety.
The uncertainty-aware decision system is designed in two stages. In the first stage, we simulate more runs of the vehicle driving without any collision avoidance system present. We rely only on the learned behavior of the vehicle (plots of these runs can be seen in Figure 1). At this stage, we are able to qualitatively understand the behavior of each network posterior in terms of the uncertainty it produces as it approaches the obstacle. The behavior of uncertainty can roughly be seen in the bottom left-hand corner of Figure 2. We note that it is possible, though less desirable, to perform this qualitative evaluation using a held-out, test data set. Because the input we observe at time depends on all of the decisions made up to that time, generating safety or uncertainty estimates based on another controllers decisions may be inaccurate due to the potentially low probability of ever observing those states with the current controller under consideration. In the second stage, we use the captured information about uncertainty in order to generate actionable warning thresholds. For example, if we see that there is typically a large spike in uncertainty as the car approaches the obstacle, we can use a threshold in order to stop the car when we experience a similar peak in the future.
We use a three tiered warning system based on real-time decision confidence, as defined in Definition 2. That is, given an image at time we bin network decisions into four categories based on the value of . Often times no warning will be thrown, i.e., for a given . However, in the case that we are less than -certain (), a standard warning (warning 1) is thrown. A severe warning (warning 2) is thrown when the network is less that -certain (this assumes ). Finally, we consider a warning (warning 0) which is thrown when neither a severe nor standard warning are thrown (), but the predictive distribution exhibits high mutual information, above yet another threshold, in our case 0.45. For our experiments, the constants and are set to a threshold of 0.7 and 0.6 respectively. The actions that occur at each of these warnings are also configurable. However, we have set up our system such that mutual information warnings slow down the vehicle, standard warnings slow down the vehicle and alert the operator of potential hazard, and severe warnings cause the car to safely brake and alert the operator that they need to assume control of the vehicle.
Setting these thresholds requires a delicate trade-off between autonomy and safety. If the thresholds are set too low, then the system will operate more autonomously (that is, without asking for user intervention), however it may be less safe. Setting the thresholds too high may be safer, but causes the car to operate less autonomously as the user is constantly prompted for input. In Figure 2, we show that these sorts of collision avoidance systems can perform well in practice. We show that we can detect and reduce the rate of collision (the inverse of probabilistic safety), improving the safety in unknown conditions from 0.00 () to 0.90 (), see Figure 1. Moreover, we test that implementation of this strategy does not affect the autonomy of the car in known situations. For this we simulate the situation in which the car was trained and we find that the car still operates with safety probability 1.00, with error margin of 0.05 according to Equation 1, and full autonomy (i.e. never stops to ask for user to assume control of the car).
Iv-B Probabilistic Safety Estimates
In order to measure the safety of a BNN controller in a particular setting, one must simulate scenarios (e.g. turns, collision avoidance, intersections) in various conditions in order to satisfy the bound in Equation 1. Though we do simulations in order to test the safety of a turn, running the correct number of simulations with diverse environmental conditions works on any scenario one would like to test. For example, the notion of probabilistic safety is also used to calculate the safety in Figure 1.
Figure 3 shows the test setup for probabilistic safety estimates. We place a vehicle approximately 10 meters from the entrance of a roundabout in fixed weather conditions. We then collect training data using the built in autopilot. The autopilot is set to drive the car through the roundabout, taking the first exit. We then use our safety boundaries to determine the probability that a specific controller will drive safely, that is, stay within our safety boundaries. We are then left with safety probabilities for each section of road tested, for each controller.
While we expect the controller to be able to safely navigate from its trained starting point to the end point in the weather it has seen, we seek to test the robustness of posterior distributions to changes in scenery and weather conditions in order to also include simulations of potential worst-case deployment performance. In row (a) of Figure 3 we see that, while the variance can be useful in collision avoidance, the wide variance of HMC causes a larger proportion of trajectories to fall outside of the safety boundary. The estimated probability of safety for HMC, across all weathers, was 0.766 (). Row (b) of Figure 3 reports the consistency of VI across different weather conditions with a cumulative safety probability estimate of 0.91 () in this particular test case. The main reason for lack of safety in VI was veering into the center lane of the roundabout. Finally, in row (c) we see the performance of MC Dropout. In the training environment, it was the only method to achieve a perfect safety score; however, we see the network fails to generalize well to other weather conditions. While MC Dropout performs slightly better than HMC in the more dim light of the afternoon, it fails catastrophically in the rain. MCDropout’s overall probabilistic safety, prior to the consideration of rain, was 0.87. When we factor rainy environments, the overall probabilistic safety of MCDropout falls to 0.58 (
). It is likely that if we were to retrain MC Dropout in all weather conditions and re-run the safety analysis we would see a perfect safety score, as we do currently with clear weather. In this way, we can use our offline safety probability as a guide for active learning in order to increase data coverage and scenario representation in training data.
We presented a framework for evaluating the safety of end-to-end BNN controllers for self-driving cars, which allows one to obtain uncertainty estimates for the controller’s decisions with a priori statistical guarantees. On experiments performed on the CARLA driving simulator we showed that our statistical framework can be used to evaluate model robustness to changes in weather, location, and observation noise. Further, we illustrate how our results can be successfully employed to detect and avoid a high percentage of collisions.