NNs are currently one of the most powerful tools for solving difficult problems in decision making. Notably, NNs are able to perform rapid and complex computations through relatively simple nonlinear calculations and massive parallel structures. Because of this, NNs have been applied to a variety of difficult classification and regression problems from object detection to robotics controls. One such task that benefits from the performance of NNs is end-to-end imitation learning for autonomous driving. In this task, difficulty arises in mapping sensor inputs into driving commands . Previous work, such as [1, 2, 3], shows the successful use of end-to-end imitation learning with applications to autonomous driving and manipulation with visual inputs. However, much of the previous work does not investigate the robustness of the learned model to sensor failure.
Although NNs are capable of successfully completing difficult tasks in a variety of applications, they are not without drawbacks. One drawback is that it can be nearly impossible to determine what the output of the NN will be given an input without using the NN itself. This is due to the nonlinear computational structure and a large number of parameters. Another drawback is that NNs are also heavily reliant on data; they generally can not make use of prior knowledge such as dynamics models. When confronted with a new input, the output of the network can vary drastically, even if there are similarities to inputs from training data. Even small perturbations to the input can alter the output of Deep NNs [4, 5] and the Deep NNs are easily fooled . This means we do not have a consistent mapping from inputs to outputs. In the context of safety, traditional NNs do not provide a measure of uncertainty of the output.
In recent years, however, new improvements have been made on probabilistic NNs. BNNs are a probabilistic network structure that produces a distribution of outputs rather than a single output. With the ability to measure uncertainty, NNs become a viable option for ensemble techniques used for decision making. Ensemble techniques consist a set of hypotheses from which they choose one as the output. In , the ensembles of perturbed models are used to perform robust trajectory optimization with respect to model uncertainty. The work in  demonstrated that a simple ensemble model can effectively approximate the predictive uncertainty of Deep Learning (DL) if the objective function obeys a proper scoring rule. This method used multiple NNs with different initializations to serve as individual models of an ensemble for approximating predictive uncertainty. However, the obtained predictive uncertainty was not directly used for improving the performance of the target task.
With knowledge of the uncertainty of each hypothesis, ensemble techniques can be used in safety-critical systems where the failure of the system causes tragic results. In this work, we propose a novel ensemble of end-to-end BNNs to provide an elegant solution to sensor failure in safety-critical systems. Our method is applied to the platform seen in Fig. 1, with the task of agile autonomous driving. With aggressive maneuvers on harsh terrain, sensors can fail from damage or are unable to operate effectively with rapidly changing conditions.
The rest of this paper is organized as follows: in Section II, we provide background information for the key ideas used in this paper. In Section III we introduce our ensemble BNNs structures and provide the algorithm for decision making. We discuss the expert used for data collection in Section IV and present results in Section V. Finally, we give our conclusions and discuss future work in Section VI.
Ii-a Aleatoric and Epistemic Uncertainty
Model uncertainty can be classified into two major categories: aleatoric and epistemic uncertainty. Aleatoric uncertainty is a result of the model’s inability to fully describe the environment, while epistemic uncertainty is a result of the inability to acquire unlimited data. In the first case, uncertainty arises when different outcomes are obtained even with the same experimental setup. The source of this type of uncertainty is the hidden variables that can not be perfectly characterized or measured. In the second case, uncertainty arises when the model is presented data not seen previously. The source for this type of uncertainty is a data set that does not fully cover the sample space. In application, it is not possible to completely eliminate either form of uncertainty as we do not have access to a perfect model or unlimited data.
The origin of aleatoric uncertainty suggests that we should be able to train a model to output this type of uncertainty given data. Meanwhile, we should also be able to measure epistemic uncertainty of a model through some form of sampling. In this paper, the total predictive uncertainty is calculated to be the combination of both uncertainty types.
The second approach uses dropout layers to produce a predictive distribution resulting from a probabilistic network structure . The Monte Carlo dropout approach is adopted in this paper since the alternative approach requires at least doubling the number of parameters in the network, which makes it difficult to run a large scale convolutional Neural Networks with only the computational resources on-board the vehicle. Using the existing NN
structure with dropout added to every weight layer, weights in the network are randomly dropped with a certain probability. At every forward pass, we sample a dropout mask from a Bernoulli distribution to determine weights dropped in each layer. During the backward pass, only the remaining weights are updated. The outputs of the network then become a Gaussian Distribution, returning the mean and variance of the prediction values, which are representative of data uncertainty. The work in shows the mathematical equivalence between an approximated deep Gaussian process and a NN
with arbitrary depth and nonlinearities when dropout layers are applied before and after every weight layer. The output distribution is estimated with Monte Carlo sampling, which can be done in parallel to reduce run-time.
Ii-C Imitation Learning
), also called “learning from demonstration”, is a type of supervised machine learning.IL is often used when the optimal solution to the task is not easily accessible or too computationally expensive to run in real time. IL algorithms assume that an oracle policy or expert is available. The expert can utilize resources that are unavailable for the imitation learner at test time, such as additional sensory information and computing power. In the case of autonomous driving, the expert can be a sophisticated optimal control algorithm or an experienced human driver. The observation-action or state-action pairs generated by the expert is then used to train the imitation learner. The goal of IL is to mimic the expert’s behavior as well as possible. In  IL’s ability to perform the autonomous driving task with low-cost sensors is demonstrated on real-world experiments.
In the traditional formulation, the goal is often to find a policy that minimizes an expected loss or maximizes an expected reward over a discrete finite time horizon :
where , and are the state, observation and admissible control spaces respectively.
is the immediate loss function.
is the joint distribution of, and for the policy for .
For imitation learning, this equation changes slightly. The goal now becomes to learn a policy that minimizes a loss function that characterizes the difference between the learned model and the expert policy rather than the most optimal :
where and denotes the neural network policy chosen. In our work, we trained our networks with batch imitation learning.
Iii Ensemble Bayesian Decision Making
Iii-a Problem Formulation
The main problem considered in this paper is the autonomous driving task for the AutoRally platform (Fig. 1), which is a scale high speed race car developed for autonomous driving experiments. As mentioned in Section I, many applications of deep end-to-end control do not provide a principle solution to sensor failure. Most self-driving cars today depend on different kind of sensors including Lidar, Radar, GPS, and cameras. However, in the real world, perfect sensors do not exist as these sensors are vulnerable to noise. In one example differential GPS (dGPS) is widely used for autonomous driving to obtain global positions in the world frame. Despite many advances in GPS technology, there is always the probability that the GPS signal jumps or slightly diverges from the true position. In areas with obstacles such as tall buildings or indoor parking lots, GPS tends to fail altogether. Additionally, cameras are sensitive to light conditions and interference from external sources. In autonomous driving, even a slight shift of the GPS data or an obscured camera may cause the car to pass the center line of the road with significant consequences. To avoid these failures, system redundancy is crucial for the safe operation of autonomous driving.
System redundancy is commonly applied in many safety-critical applications, where multiple backup systems exist to prevent catastrophic failure from one faulty component, as shown in . Redundancy is usually achieved by either duplicating the same system or using different systems that perform the same task. It is easy to have duplicative systems available in case of failure, but they are vulnerable to faults of the underlying system. Dissimilar backups, where different hardware, software, and control laws are used in each backup system, can alleviate this problem, but it is hard to determine how the backup systems are prioritized when a failure occurs.
In this work, an ensemble Bayesian decision making process is used to provide system redundancy. Multiple BNNs are implemented on the vehicle, each taking in a different sensory input and having the capability of performing the task on its own. Each BNN is trained end-to-end, learning the low-level control actions from each sensor inputs. When one or more of the sensors is compromised, the associated predictive uncertainty to the failed sensor should see a significant increase that causes the system to switch to the remaining functional networks.
Iii-B Ensemble Structure
Our ensemble consists of 3 BNNs. Each BNN differs in their network input as well as their network structure. They output the mean, , and the variance, , of the model caused by the aleatoric uncertainty. The first BNN
is trained on the fully-observable state data gathered from GPS module. Its network structure is fully connected with ReLU activation functions and layers of width 1024, 512, 256, and 128, respectively. The second network is trained on images taken from the camera on the left side of the vehicle shown inFig. 2. It is using the VGG 16-like network , with modifications to include dropout at each layer as well as output the variance, stemming from aleatoric uncertainty . The last network is trained on images taken from the camera on the right side of the vehicle shown in Fig. 2 and has the same network structure as the second network. The overall network structure can be seen in Fig. 3.
To ensure that the outputs of each BNN are the mean and variance due to aleatoric uncertainty, the heteroscedastic loss function is used. This loss function used is defined in , and is as follows:
where is the current policy of the network, is the expert’s action for input , is the aleatoric mean for input , is the aleatoric variance for input . To see how is a measure of the aleatoric variance, let us think about how aleatoric variance should behave. We would like that the predictions () that are close to the expert’s output () – resulting in a low residual error – have low aleatoric variance. Predictions that are far away from the expert’s output – resulting in a high residual error – should also have high aleatoric variance. To minimize Eq. 3 when the residual error is high, must increase so that the residual error does not have a strong impact on the loss. When the residual error is small, it is observed that also needs to be small in order to minimize Eq. 3. Intuitively, since follows the increase or decrease of the residual error to obtain a minimal loss,  concludes that is at least an approximation of the aleatoric variance. In practice, the heteroscedastic loss function is modified slightly to:
where . By regressing with , we avoid a potential ’division by ’ error and can still easily calculate . These means and variances are then used to find the output of the ensemble network as described in the next section.
To get a better calibrated uncertainty measure, we used Concrete Dropout, which allows for automatic tuning of the dropout probability in large models. The output of the redundant system structure is calculated in Algorithm 1. As described in Section II-B, we need to sample our networks multiple times in order to generate the output predictive distribution. In Algorithm 1, instead of conducting multiple runs of the network on a single input , we duplicate each input times to create an input sequence , and then input this sequence through the network, where is the number of samples used for Monte Carlo sampling. The two outputs of network
are a vector of control commandsand a vector of the aleatoric variances . Using these vector outputs the overall variance (aleatoric and epistemic combined) of each BNN is calculated with the following equation (line 7 in Algorithm 1):
where is the output of network , is the aleatoric mean of network , is the aleatoric variance of network , and () is the set of sampled outputs from network . The control of each network is calculated as the mean of that network’s sampled outputs:
Finally the output is chosen from the network with the lowest variance, , as shown in line 8 in Algorithm 1.
Iv Data Collection
In order to train each learner (BNN) in the ensemble, the iterative Linear Quadratic Gaussian/Model Predictive Control Differential Dynamic Programming (iLQG/MPC-DDP) algorithm was used as an expert. Differential Dynamic Programming (DDP) is an algorithm that uses second-order approximations of the cost function and system dynamics around a nominal trajectory to solve the Bellman equation. The optimal control policy is then used to update the nominal trajectory. Running DDP in Model Predictive Control (MPC) fashion means that at every timestep, only the first control action is executed by the system, and the control policy is re-optimized at the next timestep when new state information is received. For our long-term autonomous driving task, we used the receding horizon DDP.
Using GPS data, the expert had the following state space [, , , , , , ] as input, where and are global positions in the world frame, and are the heading and roll angle of the car, and are the body frame longitudinal and lateral velocities, and is the derivative of the heading angle.
We considered the cost function for the optimal controller composed of an arbitrary state-dependent cost and a quadratic control cost. The state-dependent term was designed to stay in the center of the track (, ) while maintaining the desired forward velocity . We set as 5m/s when we collect data. For the control cost, we used the same weights for both throttle and steering.
The expert drove around an oval track seen in Fig. 1 for 100 laps in one direction to gather data for each learner. As it drove around, the GPS data and truncated 64x128x3 RGB images from the left and right cameras were saved in order to train each of the learner models described in Section III. For training of the Dropout VGG 16 Net 
, we didn’t use any data augmentation technique (random flips, rotations, contrast, brightness, saturation, jitter, etc.) but we truncated and cropped the original 4k image to reduce the size of it to 64x128x3. All of our models were trained in batch with Tensorflow using the Adam optimizer  and the heteroscedastic loss in Eq. 4.
All computation was executed on-board the vehicle with our NVIDIA GeForce GTX 1060 GPU and we were able to obtain 10 Monte Carlo samples (
) of the ensemble in real time (20Hz). We injected artificial noise signal to each sensor similar to a real world situation in which a sensor malfunctions. The position noise was sampled from a uniform distribution to make the ”new position” appear to be outside of the track. This is commonly seen as GPS data jumps from one location to another. For images, rows of gray bars were added to simulate periodic noise caused by electro-mechanical interference during the image capturing process (Fig. (b)b).
First, we tested each BNN in the ensemble network without any artificial noise injected. Each BNN was able to drive the vehicle autonomously until the vehicle’s batteries run out and there were no failures. In all experiments, we considered the failure cases as when the vehicle crashes to the boundaries of the track and cannot move forward.
Next, each learner in the ensemble was tested individually on the vehicle with artificial noise injected. After 4 laps of normal operation, noise was added to the corresponding sensor and crashes occurred immediately, as shown in Fig. (b)b, Fig. (c)c and Fig. (d)d. The test was repeated 10 times for each learner and crashes followed promptly after noise injection every time.
Following this experiment, the Ensemble Bayesian Neural Networks algorithm was tested without noise injection. The vehicle achieved similar performance to the expert, as seen in Fig. (a)a, and was able to run in a high speed with no crashes.
Finally, the Ensemble BNN algorithm was tested with noise injection. The time horizon for testing was set to be 17 laps. The algorithm was tested on the track 3 times for a total of 51 laps. After 4 laps of normal operation (Fig. (a)a), frequent noise was added to each sensor in the order of GPS, left camera, and right camera for 2 laps. Normal operation resumed for another 2 laps before the next noise injection. As we can see from the normal operation case in Fig. (a)a, GPS NN was usually used the for most of the time. This is because the structure of the GPS NN and the data (7 states) used for it were not complex as the structure of the Dropout VGG 16 network and the data (RGB image) used for it. With simpler structure and data, it is reasonable to have a smaller variance from the probabilistic network after training. Moreover, without any injection of artificial noise, GPS data at test time does not vary much compare to the training data whereas the image from the camera slightly changes due to the change of the lighting conditions and the environment around the vehicle. Fig. (b)b shows that when artificial GPS noise was added, the algorithm opted to use camera inputs for navigation as a result of orders of magnitude increase in prediction uncertainty from the fully connected GPS NN. For both normal case and GPS-noise injected case, we observe that the left camera NN was used more often than the right camera NN. We believe this behavior is task-specific, as the vehicle run the oval track in counterclockwise for both data-collecting and testing. Since the left camera is able to see the left track boundary more often than the right camera does, the left camera NN is more confident about its prediction, resulting smaller variance. Fig. (c)c demonstrated a decrease in usage of the left camera input, since image noise caused uncertainty from the corresponding network to double. Similar results can be found in Fig. (d)d when image noise was added to both left and right cameras. Compare to the cases where we did not inject any noise (Fig. (a)a) or noise was injected in a single camera (Fig. (c)c), we can see the decreased usage of both cameras. For all cases of noise injection scenarios, the noise was injected frequently, but not always, so the noise-injected learner could be used intermittently when the noise did not exist. The usage of each learner with sensor noise injection is listed in Table I. In all cases, the usage of the sensor(s) was decreased when the artificial noise was injected to the sensor(s). Even with a large noise, which causes the immediate failure of the task for individual BNN, all laps were completed without any failure. The complete trial can be seen in the video111https://www.youtube.com/watch?v=ZArsGyR6M0g.
In this paper, we introduce an Ensemble Bayesian Neural Network structure for system redundancy in the decision making of safety-critical systems. Our algorithm was implemented on an autonomous driving task using end-to-end Imitation Learning. Prediction uncertainty capturing both model imperfection and data insufficiency of each BNN within the ensemble was used to switch between the different policy outputs. Experimental results verified the robustness of our proposed method against compromised sensor inputs. Our method can play an important role in any kind of autonomous systems using multiple sensors, especially dealing with safety-critical tasks. For future works on Ensemble Bayesian decision making, we will explore smooth Bayesian mixing models in order to eliminate discrete switching between our learners. Furthermore, we would like to also extend this Ensemble Bayesian approach to robust filtering and state estimation problems, where we use multiple sensors or networks.
The authors would like to thank Nolan Wagener for sharing the experiment platform when one of our sensors was malfunctioning.
-  Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. A. Theodorou, and B. Boots, “Agile autonomous driving using end-to-end deep imitation learning,” Robotics: Science and Systems, 2018. [Online]. Available: http://www.roboticsproceedings.org/rss14/p56.pdf
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” CoRR, vol. abs/1504.00702, 2015. [Online]. Available: http://arxiv.org/abs/1504.00702
-  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to End Learning for Self-Driving Cars,” apr 2016. [Online]. Available: https://arxiv.org/abs/1604.07316
-  J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,” CoRR, vol. abs/1710.08864, 2017. [Online]. Available: http://arxiv.org/abs/1710.08864
-  S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel, “Adversarial attacks on neural network policies,” arXiv, 2017. [Online]. Available: https://arxiv.org/abs/1702.02284
-  A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” 2015.
-  I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble-cio: Full-body dynamic motion planning that transfers to physical humanoids,” Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, 2015. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7354126/
-  B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” pp. 6402–6413, 2017. [Online]. Available: https://arxiv.org/pdf/1612.01474.pdf
-  B. Goldfain, P. Drews, C. You, M. Barulic, O. Velev, P. Tsiotras, and J. M. Rehg, “Autorally an open platform for aggressive autonomous driving,” arXiv preprint, 2018. [Online]. Available: https://arxiv.org/pdf/1806.00678.pdf
-  A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” arXiv preprint, mar 2017. [Online]. Available: http://arxiv.org/abs/1703.04977
-  C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural network,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 1613–1622. [Online]. Available: http://proceedings.mlr.press/v37/blundell15.html
-  Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of The 33rd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 48. New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 1050–1059. [Online]. Available: http://proceedings.mlr.press/v48/gal16.html
-  G. Stein, “Respect the Unstable,” 2003. [Online]. Available: https://ieeexplore.ieee.org/document/1213600/
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556
-  K. Lee, K. Saigol, and E. Theodorou, “Safe end-to-end imitation learning for model predictive control,” CoRR, vol. abs/1803.10231, 2018. [Online]. Available: http://arxiv.org/abs/1803.10231
-  Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in Neural Information Processing Systems 30 (NIPS), 2017. [Online]. Available: https://arxiv.org/abs/1705.07832
-  Y. Tassa, T. Erez, and W. D. Smart, “Receding horizon differential dynamic programming,” pp. 1465–1472, 2008. [Online]. Available: http://papers.nips.cc/paper/3297-receding-horizon-differential-dynamic-programming.pdf
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980