For autonomous systems to be able to operate in uncertain environments they have to be equipped with robust decision-making capabilities using a variety of perceptual modalities including vision. Recent advancements in Artificial Intelligence and Deep Learning, have facilitated the development of algorithms that integrate perception and control in a holistic fashion. The resultingperceptual control policies offer unique capabilities with respect to generalization, representation, and performance in tasks such as vision-based navigation.
Prior work on vision-based navigation relies on image classification through object detection and segmentation. Shelhamer et al. , Ren et al. , He et al.  introduce ways to improve performance in instance segmentation including object detection and semantic segmentation. Region Proposal Network (RPN)  determines the Region of Interests on the image and for each ROI, the network determines the class label of the object by ROI-pooling . With the bounding box refinement step in the RPN, the ROI
boxes have different sizes. Classifiers do not handle variable input size very well because they usually require the input size to be fixed. This is where ROI-pooling comes into play.He et al. 
introduced the ROIAlign technique to solve the quantized stride problem of ROI-pooling and showed excellent results in instance segmentation. ROIAlign preserves the spatial orientation of features without loss of data.
Alternative methodologies for performing vision-based navigation are via Imitation Learning (IL), also referred to as learning from demonstration. In the IL framework, the learning algorithm has access to an expert policy to take advantage of. This expert policy can come, for instance, from human demonstrations or from a Model Predictive Control (MPC) controller. For an autonomous driving task, Bojarski et al.  proposed an approach for learning to drive a full-size car autonomously directly from vision data. Moreover, Pan et al.  demonstrated an online approach of end-to-end IL using DAgger  for the high-speed autonomous driving task.
The tremendous success of these methods, however, cannot diminish the importance of safety due to the fact that conventional deterministic Deep Neural Network policies are fragile to adversarial attacks [9, 23, 16]. Lee et al. 
address the problem of incorporating safety by using the Bayesian approximation approach to quantify uncertainty in the output network control policy. By using a Bayesian neural network, the authors were able to pass the control authority to an optimal controller when the network outputs a large variance/uncertainty. One of the major limitations of this approach is that the increase of uncertainty at the output of the Bayesian neural network occurs when the vehicle is at a very close distance to the object. As a result, the time horizon within which the control authority is passed to an expert or fully observable predictive controller is small. This characteristic may result in an abrupt change in the controls or aggressive maneuvers which in turn may result in instabilities and ultimately accidents.
In this work, we view perceptual control policies as Information Processing Architectures (IPAs) and propose a novel architecture. The proposed IPA is used to support a Perceptual Attention-based Predictive Control (PAPC) algorithm that is capable of detecting objects in far distances while performing control using vision. Our approach has the following ingredients: (1) it uses IL to learn a perceptual controller; (2) it builds upon the Bayesian approach and (3) incorporates a novel attention mechanism that robustifies the detection of new objects even when these objects are located in far distances from the vehicle. PAPC takes advantage of an MPC expert by using the future state trajectories to determine ROIs on the input image. In PAPC, decision-making and perception are tightly coupled since predicted state trajectories force perception to focus in relevant, with respect to the future motion of the vehicle, areas of the input visual information. This attention mechanism enables early detection of unseen situations, such as the cases when a new obstacle appears in the driving lane. When such situations arise, the network policy is to concede control to a safe policy or expert, such as a fully observable MPC controller.
In summary, the contributions of this work are provided as follows:
We introduce the Model Prediction-Network (MP-Net) for learning trajectories represented as splines. The MP-Net is trained using input-output pairs that consist of images and state trajectories generated by MPC.
We use MP-Net as to determine ROIs in the input visual information. These ROIs are smaller than the initial image size, areas with variable resolution determined by the MP-Net predicted trajectory.
We introduce the Macula-Network (Macula-Net), a 3D Convolutional Neural Network (CNN). The Macula-Net uses as input the aforementioned ROIs and generates the corresponding controls as well as estimates of aleatoric and epistemic uncertainty. The Macula-Net is trained in a Bayesian fashion using input-output pairs consisting of ROIs and corresponding control commands generated by MPC.
We integrate all the aforementioned blocks and create the PAPC algorithm. PAPC detects novel objects in far distances from the vehicle while navigating it in an off-road environment. Detection of these objects is performed without any image classification or object detection.
The PAPC algorithm is tested and compared against prior state-of-the-art solutions. Experiments are performed in simulation as well as on real hardware and demonstrate the benefits and outperformance of PAPC.
The remaining of the paper is organized as follows: In Section II, we briefly review some preliminaries used in our work. In Section III, we introduce the Model Prediction Network, which predicts the future location of the vehicle in pixel coordinates used to construct ROI windows. In Section IV, we introduce the Macula-Net, which processes the ROIs and outputs a control mean and variance. We also detail our Perceptual Attention-based Predictive Control (PAPC) algorithm. Section V details simulation and real hardware experiments with analysis and comparisons of the proposed methods. Finally, we conclude and discuss future directions in Section VI and Section VII.
In this section, we provide the building blocks of the proposed Information Processing Architecture (IPA) for perceptual control. The aforementioned building blocks are MPC, Imitation Learning (IL), Bayesian Neural Networks, and B-Splines for trajectory representation.
Ii-a Model Predictive Optimal Control
MPC-based optimal controllers (e.g. iterative Linear Quadratic Gaussian/Model Predictive Control Differential Dynamic Programming (iLQG/MPC-DDP) , MPPI ) provide planned control trajectories, given an initial state and a cost function by solving the optimal control problem. An optimal control problem whose objective is to minimize a task-specific cost function can be formulated as follows:
subject to dynamics , where represents the system states, represents the control, is the state cost at the final time , is the running cost, and is the value function. By solving this optimization problem, we get the future optimal state trajectories from the optimal control trajectories. In this paper, we take advantage of the optimal state and control trajectories provided by MPC to train perceptual control policies and design an attention mechanism. As it will be explained later, the state trajectories will be used to train CNNs to predict ROIs using raw images while control will be used to train another CNN to predict control using as input the aforementioned ROIs.
Ii-B End-to-End Control via Imitation Learning
IL is one way to learn how to do a specific task by imitating a teacher’s or an expert’s control policy. In IL settings, it is usually assumed that the expert is perfect and always makes optimal decisions. The IL framework allows us to do end-to-end control since it bypasses all the burdensome steps in navigation (perception, filtering, localization, path planning, etc.) and directly applies control only with given observations.
The goal of IL is to learn a policy that minimizes the difference between the task-specific cost compounded by the expert’s policy and that incurred by the learner. To achieve the goal, the learned policy should aim to converge to the expert’s policy. For a deep learning based IL, one loss function that can be used to train the learner network for this regression problem is the mean squared error between the network’s predictions and the ground truth control actions labeled by an expert.
Ii-C Bayesian Neural Networks
The assessment of how far is a DNN from its training set is an essential and necessary capability for the deployed of DNNs to safety-critical applications. To incorporate this capability to perceptual end-to-end control policies, we will use Bayesian neural networks. Currently, in the field of Bayesian neural networks, Bayes by back propagation  and Monte Carlo (MC) dropout 
methods are widely used in broad applications. The MC-dropout method uses the dropout technique to build a probability distribution over network weights and this method allows us to obtain the distribution of the prediction at test time. This output distribution comes from the input distribution, where the trained Bayesian network will output a large output variance if the input distribution at the test time is largely different from the training input distribution.Kendall and Gal  introduced the heteroscedastic loss function which provides two different notions of uncertainty, aleatoric and epistemic. Aleatoric uncertainty comes from incomplete knowledge of the environment whereas the epistemic uncertainty comes from the incomplete data. In this work, we use this heteroscedastic loss function to train our Bayesian network and use the network’s output variance for the early detection of novel inputs.
The B-spline is a collection of Bezier splines that are defined by a set of knot coordinates around which each spline is centered . This set of splines has the following continuity requirements: i) The end of the previous curve must have the same value as the start of the next. ii) The first and second derivatives must be conserved between the intersecting points. Therefore, a high-degree B-spline can smoothly approximate a curve. The equation for a k-degree B-Spline is formulated below.
where () are control points and are the basis functions defined using the recursive Cox-de Boor formula . In this work, the B-spline coefficients were used to train the Model Prediction Network described in the next section.
Iii Model Prediction Network
The MPC-based controllers provide the future state (e.g. position, velocity) trajectories. Inspired by MPC, we introduce a network which predicts a robot’s future positions in the image space. Based on the predicted future trajectories, we find ROIs that the robot can focus on.
In our attention mechanism, in order to find the ROIs on the image, we use the MPC’s state trajectory to find the ROIs by mapping the trajectory in the original state space to a corresponding trajectory in pixel coordinates, as seen in Fig. 1. This allows us to “see” wherein the image the car will be in the future timesteps. In turn, we pick a specific number of focal points (Fig. 1 (D)) along the obtained pixel trajectory, which we use to create the ROI windows. This attention mechanism allows our safe policy to be sensitive even to unseen obstacles that are far away and small in its image view, as long as they are in the way of its trajectory.
The mapping of the state trajectory from state space to pixel coordinates will be implicitly learned with a deep convolutional neural network (CNN) we refer as Model Prediction Network (MP-Net). MP-Net has a similar network structure as VGG16 , a widely used deep convolutional neural network structure which learns the mapping between input images and the corresponding class labels. In MP-Net, as we deal with the regression problem, we train the network with the mean squared error loss. This network will take an image as input and will output a trajectory in pixel coordinates. In Section III-A, we describe how we obtain the targets to train this network, and in Section III-B, we detail how the MP-Net is trained on those targets.
Iii-a Targets for MP-Net using Coordinate Transformation
As seen in Fig. 1, MP-Net needs to project the vehicle’s future state trajectory described in the world coordinates onto a 2D image in a moving frame of reference. This coordinate transformation technique is widely used in 3D computer graphics . The coordinate transformation consists of 4 steps:
In this work, we follow the convention in the computer graphics community and set the Z (optic)-axis as the vehicle’s longitudinal (roll) axis, the Y-axis as the axis normal to the road, the positive direction being upwards, and the X-axis as the axis perpendicular on the vehicle’s longitudinal axis, the positive direction pointing to the left side of vehicle.
Let us define roll, pitch, yaw angles as and the camera (vehicle) position in the world coordinates. The camera focal length is defined as .
Then, we construct the rotation matrices around the U, V, W-axis , the translation matrix , the robot-to-camera coordinate transformation matrix and the projection matrix as:
where the projection matrix projects the point in the camera coordinates into the film coordinates using the perspective projection equations  and the offsets and transform the film coordinates to the pixel coordinates by shifting the origin.
The total rotation matrix is computed as
and the matrix , transforming the world coordinates to the robot coordinates by translation and rotation, is calculated as
Then, after converting the X,Y,Z
-axes to follow the convention in the computer vision community through, the projection matrix converts the camera coordinates to the pixel coordinates. Finally, we get the matrix, which transforms the world coordinates to the pixel coordinates:
To obtain the vehicle (camera) position in the pixel coordinates (u,v):
However, this coordinate-transformed point in the pixel coordinates has the origin at the top left corner of the image. In our work, as we deal with the state trajectory of the vehicle, we define the new origin at the bottom center of the image , where and represents the height and width of the image, and rotate the axes by switching and . Finally, we subtract from and get the final :
Iii-B Training MP-Net
Instead of training the MP-Net to predict the entire trajectory in the pixel coordinates, we train it to learn the spline coefficients of the trajectory. This is possible because the MPC trajectories are simple and smooth enough to be represented with splines. This greatly simplifies the regression problem, without jeopardizing performance. To train for spline coefficients, we first fit a spline through the pixel trajectory (obtained as detailed in the previous section) and we regress on the spline coefficients. From predicted spline coefficients, we reconstruct a spline and sample a fixed number of focal points to create ROIs.
Another way to get the focal points is directly regressing them in pixel space. However, this is not flexible to changes in the number of focal points, as the network would have to be re-trained for a different number of points. Our spline-learning approach allows us to generate any number of focal points, which proved to be very useful during experimentation.
We compared the prediction error of the spline-learning and the direct focal points learning method. For a fair comparison, we used the same CNN architecture for both methods. For spline-learning, we trained the network to predict the eight B-spline coefficients and for the direct focal points learning, we trained the same model to predict the four focal points in the pixel coordinates. Note that although the spline-learning method outputs a spline coefficients, we can still evaluate the spline at specific locations to obtain the focal points. Our experiments showed that the spline-learning
method required much fewer training epochs and it clearly outperformed the direct focal points learning approach, which was trained with 100 times more training epochs. The average testing error (MSE) in pixel space was 0.4 for thespline-learning method and 25.2 for the direct focal points learning method. Here, we argue that even with the same number of values, the spline coefficients carry much more information than pixel position values do.
Iv Perceptual Attention-based Predictive Control
As described in Fig. 2, once we obtain the focal points with the MP-Net, we construct ROI windows according to these focal points and feed these into the Macula-Net, named after the central part of our eyes’ retinas, where we get the clearest vision with most resolution. The Macula-Net will take these ROIs and output a control mean and variance via the Bayesian dropout method [13, 6]. For the architecture of the Macula-Net (Fig. 3), we adopted the 3D version of VGG 16 , even though researchers have been recently developed smaller network structures with better accuracy. This is because of the easiness of the VGG structure, where we can simply apply the Concrete Dropout  method on top of it. Even though the network size is not small, compared to other structures, we are still able to get a good approximation of the Bayesian Network via Monte Carlo sampling with around 25 samples in real-time (20Hz).
The Macula-Net is trained using the heteroscedastic loss function  to produce the distribution as an output. This loss function is defined as
where is the target data, and and represent the distribution of the prediction.
The ROIs are constructed as follows:
Define the fovea focal point as the farthest focal point along the spline.
Construct the smallest ROI, referred to as fovea, as a window of size 32x32 with as its center.
For each of the other focal points , construct an ROI with center
and a window size of (), where
In this way, each ROI can cover the corresponding focal point and the fovea with some margin, defined manually.
As we can see in (D) of Fig. 2, unimportant features (e.g. buildings, sky, trees, etc.) have been removed from the input image. This is one of the greatest advantages, wherein the network focuses on the important/task-related regions of the image, while also eliminating irrelevant parts of the image.
We resize all ROIs into the same size as the smallest ROI, which is constructed from the farthest focal point generated by the MP-Net. By this step, with 4 ROIs, the concatenated 3D image has a size (4, 32, 32, 3), two times smaller than the original input (64, 128, 3).
The resizing step is inspired by the Glimpse Sensor , where multiple resolution patches were used to improve classification performance. Unlike the Glimpse Sensor, we do not use a simple fully connected layer after the concatenated multi-resolution 2D images. Rather, we process 3D convolutions to extract attention-based 3D information among the stacked images. Through the resizing step, the smallest ROI from farthest focal point maintains its resolution, but the other ROIs downsample to the fixed size. As a result, bigger ROIs get lower resolution due to the downsampling and this is where the network resembles the parafovea/perifovea area of the macula. When we see something, we can focus on a specific region clearly with a high resolution (fovea) but other regions on its outside are blurred out, thus having a lower resolution (parafovea/perifovea). We can think about this resizing step as putting more weights to the smallest and the most important ROI.
We stacked images in 3D, resulting in another dimension, z, but the number of stacked images is not that big, so we do not want this dimension to be reduced and lose information by a pooling layer. Therefore the 3D max-pooling layers in the network act like 2D max-pooling layers because they do not pool the z-dimension. For 3D filters, we used (3, 3, 3) kernels.
V-a End-to-End Autonomous Driving with Anomaly Detection
First, we collect data from our model predictive controller, driving a 1/5 scale vehicle around an oval track for 100 laps. Then, we produce one more piece of data, which is the stacked 3D image at every time. This is generated from the coordinate transformation of the MPC’s trajectories and the B-spline method as described in Section III-A. When training, we train both MP-Net and Macula-Net separately with the same set of data, but different loss functions. The MP-Net requires pairs of the original image and the spline coefficients of the planned path and the Macula-Net needs 3D stacked images and control action pairs. The control action in our experiments is the vehicle steering command which is a continuous real number between .
We observed in the experiments that out of the two different uncertainties we get from our Bayesian network trained with the heteroscedastic loss function, the value of the epistemic uncertainty showed a drastic change reacting to the novel observation whereas the value of the aleatoric uncertainty does not show a big difference. This is reasonable because the epistemic uncertainty is the one coming from the lack of data, and it provides large variance given novel input data. Therefore, we used this epistemic uncertainty as the network’s output variance, the safety signal.
We tested our algorithm in a simulated autonomous driving environment in ROS  as well as with our real hardware. We conducted 100 test runs with 5 different obstacles in ROS to evaluate PAPC’s performance compared to the state of the art. In the real hardware experiments, we conducted 10 trials per each obstacle and per networks for comparison. All of the experiments were done with NVIDIA GeForce GTX 1050 Ti GPU for the real hardware experiments and 1060 GPU for simulation experiments.
optimizer in Tensorflow. We used the Concrete Dropout 
to find the optimal dropout probability per each layer in our Bayesian Network. After every convolution and fully connected layer, we performed batch normalization to speed up the training and there was no data aggregation involved except for the 3D stack part and all models were trained in batch.
We set a threshold for the output variance signal to 3-10 times larger than the maximum value of the usual variance in the normal case, depending on the number of samples we choose for the MC-dropout. The usual output variance in the normal situation without any novel object in the scene was between and , depending on the number of samples as well. We used this safety threshold for an emergency stop of the autonomous vehicle. If the output variance is larger than the threshold, the vehicle will be stopped.
takes as input full view of images and is trained to predict the planned path by imitating our model predictive controller. Through experiments, we found that VGG-like Network is trained to map important and relevant features to the target output, which is the planned trajectories inMP-Net. As we can see in the activated feature maps at each layer, the extracted important features were the line information of the track. In other words, the VGG-like MP-Net is trained to map the track part of the image to the corresponding output (future path in few seconds ahead). As the MP-Net gets the full view of the image as an input, the new obstacle does not affect/fool the network output most of the time until the obstacle dominates the track in the image. We tested with putting a different kind of objects on the track and getting rid of some features like trees or buildings (in simulation), but neither of them could fool the network because the most important features, track boundaries, still existed in the image. Therefore, we can believe the MP-Net to predict the correct trajectories.
From the cropped, re-sized, and stacked 3D image data, the model is trained to drive the vehicle autonomously in an end-to-end fashion. Our policy not only performs the original task, but also it is able to quantify uncertainty in its control policy when novel obstacles are put in the track, even far-off from the vehicle.
We place a novel obstacle (Table I), which was never been seen in the training data on the vehicle’s path. Lee et al.  showed in their previous work the increased variance signal from the Bayesian neural network when a vehicle saw a novel object on the road. However, as they mentioned in their work, the increase of the variance signal was gradual, but not fast enough to take actions to avoid new obstacles at a high speed. Here, with our PAPC
algorithm, we show a faster anomaly detection which gives the vehicle enough time to avoid it without any collision (Fig. 8).
|New Objects||DropoutVGG ||PAPC [Ours]|
Using only the smallest ROI around the furthest focal point, it is hard for a network to learn a task since the smallest ROI does not contain enough information related to the task. However, even with a low resolution, we have bigger ROIs which have some information related to given tasks. With a combination of this smallest ROI and the largest ROI, the model can learn the task as well as focus on important regions of the image to pay attention to and report safety threats.
We can see that through the PAPC algorithm, the input image data to the Macula-Net excludes unimportant features (e.g. buildings, trees, sky, etc.) and only focus on the part where the MPC guides the network to focus on.
To stop the high-speed vehicle before collision or to avoid an obstacle, 1-2 seconds (5-10m) will be required to take proper action for a vehicle driving in around 5m/s. Within 1-2 seconds, a human expert or model predictive controllers can take control of the vehicle and avoid the obstacle since the model predictive controllers do optimal path planning and control within 20-50ms.
We analyze our method via plotting the averaged feature maps after the first two max-pooling layers per each ROI (Fig. 6). ROI 1 (fovea) is activated to the feature of a new obstacle, more than the track boundary. However, interestingly, as ROI becomes larger and downsampled, we can see that the ROI is more activated from the track boundary. From this combination of multi-resolution 3D inputs, the PAPC algorithm is able to detect a new object while focusing on the task-related features in a normal situation, where abnormal objects do not appear. Also, we can see that in Fig. 6
that the deeper layer (closer to the output layer) tends to focus more on a single feature. For example in ROI 2, after the first max pooling layer, the neurons are activated almost equally from both the new object and the track boundary. However, after two more convolutional layers and a max pooling layer, the neurons are activated only from the single feature, the track boundary. This pattern is also seen in otherROIs.
In Fig. 6, we can also see that the resolution of the ROIs becomes lower as the ROI becomes larger since it needs more downsampling. From this downsampling, we can observe the loss of information on the bigger ROIs, where the brightness has been changed.
V-B Failure Cases
Our Network sometimes fails, depending on the object size or color. PAPC was not able to detect a can in the simulation environment and a detergent container in the real world (Fig. 9). We argue that their distributions were not different enough from the training data, even though the fovea ROI caught the object correctly. However, when the vehicle gets closer to the objects, the fovea ROI has already passed it, so no more strong attentions exist at that time. We believe these kinds of smaller objects or those having a similar distribution to the training data without obstacles can be detected by increasing the number of focal points and ROIs, with the fovea having a smaller window. Having more ROIs will require faster GPUs and smaller network structures to run the network in real time, however.
All of the simulation and real world experiments can be found in the video: https://youtu.be/-Zmi0HCvM9I.
V-C Data Distribution Visualization with t-SNE
Because the Macula-Network uses the Bayesian dropout  approach to determine whether ROI contains a new obstacle, it is useful to analyze how the distribution of the ROIs with and without obstacle differ from one another. We use the t-SNE technique for dimensionality reduction in order to visualize the high dimensional ROI images that the Macula-Network takes in as input. We run t-SNE with perplexity values around 60 with simulated and real-world ROI data. In addition, we not only run t-SNE with the ROI images but also with the output of the first fully-connected layer in the Macula-Net. Using a middle layer allows us to visualize how the Macula-Net itself interprets the ROI images. In Fig. 7, we show that the Macula-Net features of the images with obstacles lie in a noticeably different distribution (colorful cluster) as the images with no obstacles. This shows that the Macula-Net was able to capture a change in the distribution of the ROIs when an obstacle is put in the track. Because these image samples with obstacles lie in a different distribution as compared to the training data, the Bayesian network inside the Macula-Net will output control values with high variance, thus indicating the policy that it is no longer safe to drive. In this manner, the algorithm will be able to tell when to concede control to a safer controller.
For our future work, we would like to explore smaller network structures instead of the VGG-based network. We were able to run our robot in real time (15-20Hz), but we could only sample around 25 samples from the Monte Carlo dropout. We also tested with more samples in lower control frequency and saw less noisy anomaly detection signal from the output variance of our Bayesian Network.
While PAPC is able to detect novel objects and increase the output predicted uncertainty, the duration of this detection is instantaneous. This because the detection depends on the time interval during which the ROI with the highest resolution on the tip of the MP-Net predicted trajectory overlaps with the new object. In future work, we plan to combine tracking mechanisms with PAPC so that to achieve an increase in predicted uncertainty of the Macula-Net for as long as the new object is in the field of view of the vehicle.
We would like to emphasize here that the proposed IPA, namely the PAPC architecture, can be used in any autonomous system that performs navigation using vision sensors for a safe path planning and control (e.g. visuomotor for manipulation ). In addition, while our initial goal is to use the PAPC architecture as the main system for navigation, its operational role can also be as a secondary safety controller performing anomaly detection on the observation side, while the main controller is running. PAPC can raise an emergency flag when it sees something new. This flag can be used in the decision-making module for designing a robust/adaptive controller. In addition, this anomaly detection approach can be used for improving data aggregation during learning or for intelligent exploration in unknown or partially known environments.
In this work, we view perceptual control policies as Information Processing Architectures, or IPAs for short, and propose a new architecture to support an algorithm for perceptual attention-based control. The architecture for PAPC consists primarily of two CNNs, namely the MP-Net and the Macula-Net which is introduced for the first time in this paper. The MP-Net is trained so that to generate predicted state trajectories using vision. These trajectories determine ROIs in the image with variable resolution. Using as input the aforementioned multi-resolution ROIs, the Macula-Net is able to focus on relevant, with respect to the task, areas of the input visual information, detect novel objects and control the vehicle in consideration. We validated our proposed attention-based deep Bayesian network in both ROS simulator and the real hardware for a safety-aware autonomous driving task. PAPC was able to detect different obstacles quickly in comparison with the state-of-the-art approach of the end-to-end Bayesian network .
This research was supported by the Amazon Web Services Machine Learning Research Awards.
- Abadi et al.  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Blundell et al.  Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight Uncertainty in Neural Network. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1613–1622, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/blundell15.html.
- Bojarski et al.  Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to End Learning for Self-Driving Cars. apr 2016. URL https://arxiv.org/abs/1604.07316.
- Boor  Carl De Boor. On calculating with B-splines. Journal of Approximation Theory, 6, 1970. URL https://www.sciencedirect.com/science/article/pii/0021904572900809.
- Boor  Carl De Boor. Package for calculating with B-splines. SIAM Journal on Numerical Analysis, 14:441–472, 1977. URL https://www.jstor.org/stable/2156696?seq=1#page_scan_tab_contents.
- Gal and Ghahramani  Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/gal16.html.
- Girshick  Ross Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), 2015. URL https://arxiv.org/abs/1504.08083.
- He et al.  Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), Oct 2017. URL http://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf.
- Huang et al.  Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial Attacks on Neural Network Policies. arXiv, 2017. URL https://arxiv.org/abs/1702.02284.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. pages 448–456, 2015. URL http://jmlr.org/proceedings/papers/v37/ioffe15.pdf.
- Kendall and Gal  Alex Kendall and Yarin Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc., 2017. URL https://arxiv.org/abs/1703.04977.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
- Lee et al.  Keuntaek Lee, Kamil Saigol, and Evangelos A. Theodorou. Safe end-to-end imitation learning for model predictive control. CoRR, abs/1803.10231, 2018. URL http://arxiv.org/abs/1803.10231.
- Levine et al.  Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research, 17(39):1–40, 2016. URL http://jmlr.org/papers/v17/15-522.html.
- Mnih et al.  Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. Recurrent Models of Visual Attention. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2204–2212. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf.
Nguyen et al. 
Anh Nguyen, Jason Yosinski, and Jeff Clune.
Neural Networks are Easily Fooled: High Confidence Predictions for
Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, 2015. URL https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7298640.
- Pan et al.  Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek Lee, Xinyan Yan, Evangelos A. Theodorou, and Byron Boots. Agile Autonomous Driving using End-to-End Deep Imitation Learning. Robotics: Science and Systems, 2018. URL http://www.roboticsproceedings.org/rss14/p56.pdf.
- Quigley et al.  Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. ROS: an open-source Robot Operating System. In ICRA Workshop on Open Source Software, 2009. URL http://www.willowgarage.com/sites/default/files/icraoss09-ROS.pdf.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015. URL https://arxiv.org/abs/1506.01497.
- Ross et al.  Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15 of JMLR, Fort Lauderdale, FL, USA, 2011. URL http://proceedings.mlr.press/v15/ross11a/ross11a.pdf.
- Shelhamer et al.  Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640–651, April 2017. ISSN 0162-8828. doi: 10.1109/TPAMI.2016.2572683. URL https://doi.org/10.1109/TPAMI.2016.2572683.
- Simonyan and Zisserman  Karen. Simonyan and Andrew. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1409.1556.
- Su et al.  Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. CoRR, abs/1710.08864, 2017. URL http://arxiv.org/abs/1710.08864.
- Tassa et al.  Yuval Tassa, Tom Erez, and William D. Smart. Receding Horizon Differential Dynamic Programming. Advances in Neural Information Processing Systems 20, pages 1465–1472, 2008. URL http://papers.nips.cc/paper/3297-receding-horizon-differential-dynamic-programming.pdf.
- Trucco and Verri  Emanuele Trucco and Alessandro Verri. Introductory Techniques for 3-D Computer Vision. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1998. ISBN 0132611082.
- Williams et al.  Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Aggressive driving with model predictive path integral control. 2016 IEEE International Conference on Robotics and Automation (ICRA), 2016. URL https://ieeexplore.ieee.org/abstract/document/7487277.
- Yarin Gal and Jiri Hron and Alex Kendall  Yarin Gal and Jiri Hron and Alex Kendall. Concrete Dropout. In Advances in Neural Information Processing Systems 30, 2017. URL https://arxiv.org/abs/1705.07832.