End-to-end visual-based driving receives various interests both from deep reinforcement learning[1, 2] and imitation learning 
. In this paper, we mainly consider visual-based imitation learning, where a model is trained to guide the vehicle behaving similarly to the human demonstrator based on visual information. As a model-free method, raw visual information and other related measurements are taken as the input to a deep model, which is commonly a deep convolutional neural network (CNN) model. The deep model then outputs control commands directly, like steering and acceleration. It has been successfully applied in both indoor navigation and outdoor autonomous driving .
Even though Learning-based methods achieved lots of breakthroughs for autonomous driving and mobile robot navigation, the uncertainty is rarely considered when deploying the trained policy. However, uncertainty is significantly critical for robotics decision making. Not like the other pure perception scenarios, where higher uncertainty of the prediction may influence the accuracy of a segmentation mask or output an incorrect classification result, the non-confidential decision in autonomous driving would endanger the safety of vehicles or even human lives. Thus, we should not always assume that the output of the deep model is accurate. Knowing what a model does not understand is an essential part especially for autonomous driving under dynamic environments and interacting with pedestrians and vehicles.
through generative adversarial networks (GAN). Most of the previous work focused on image-to-image transfer through a deterministic generator. However, the imitation learning policy is usually trained in a multi-domain environment with various conditions for better generalization. Thus, for a deterministic translation, the problem is which training scenario should we transfer the real-world image to. In this paper, we extend this pipeline to generate various translated images with training data styles through multimodal cross-domain mapping. To generate the transfereed images, we can randomly sample style codes from a normal distribution or directionally encode the provided style images from the training domain. The content code is extracted from the real-world image collected from the mounted sensor in real-time. A decoder would take the content code and style codes as input to generate various stylized images.
Naturally, we could predict the actions and uncertainties of all the translated images through the proposed uncertainty-aware imitation learning network. Among those generated images, the most certain one will be considered to deploy to the agent.
We list the main contributions of our work as follows:
We transfer the real driving image back to diverse images stylized under the familiar training environment through a stochastic generator so that the decision is made through multiple alternate options.
The uncertainty-aware imitation learning network provides a considerable way for driving decision which improves the safety of autonomous driving, especially in dynamic environments.
We explain the aleatoric uncertainty from the view of the noisily labelled data samples.
Ii Related Works
In this section, we mainly review related works in end-to-end driving, uncertainty-aware decision making and visual domain adaptation.
Ii-a End-to-end Driving
For visual-based strategies in autonomous driving and robot navigation, traditional methods firstly recognised related objects from visual inputs including pedestrians, traffic lights, lanes, cars and so on. That information would be considered to make the final driving decisions based on manually designed rules . Benefits from the great approximation ability of deep neural networks, end-to-end methods become more and more popular in vision-based navigation recently.
Tai et al.  used deep convolutional neural networks mapping depth images to steering commands so that the agent can make meaningful decisions as the human demonstrator in an indoor corridor environment. A similar framework was also successfully applied in a forest trail scenario to navigate a flying platform for obstacle avoidance . They also considered to softly combine all the discrete commands based on the weighted outputs of the softmax structure. Codevilla el al.  designed a deep structure with multiple branches for end-to-end driving through imitation learning. Based on the high-level command from the global path planner, outputs from the specific branch would be applied to the mobile agent.
Reinforcement learning (RL) algorithms also show surprising effects on end-to-end navigation. Zhang et al.  explored the target-arriving ability of a mobile robot through reinforcement learning based on a single depth image. Their policy can also quickly adapt to new situations through successor features. For autonomous driving, RL algorithms are also considered to train an intelligent agent through interaction with simulated environments like Carla . Liang et al.  used the model weight trained through imitation learning as the initialization of their reinforcement learning policy. Tai et al.  proposed to solve the socially compliant navigation problem through inverse reinforcment leanring.
However, all of the methods above directly deploy the learned policy on related platforms. None of them considered the uncertainty of the decision.
Ii-B Uncertainty in learning-based decision making
The uncertainty in deep learning is derived from theBayesian deep learning  approaches, where aleatoric uncertainty and epistemic uncertainty are extracted through specific learning structures 
. Recently, computer vision researchers started to leverage those uncertainties on related applications like balancing the weight of different loss items for multi-task visual perception
. The uncertainty estimation helps those deepBayesian models achieving state-of-the-art results on various computer vision benchmarks including semantic segmentation and depth regression.
In terms of decision making in robotics, Kahn et al  proposed an uncertainty-aware model-based reinforcement learning method to update the confidence for a specific obstacle iteratively. During the training phase, the agent would behave more carefully in unfamiliar scenarios at the beginning. Based on this work, Lutjens et al  explored the more complex pedestrian-rich environments. The uncertainty was further considered for the exploitation and exploration balance in their implementation . Henaff et al.  focused on the out-of-distribution data where an uncertainty cost was used to represent the divergence of the test sample from the training states. However, all of the methods above are following the pipeline using multiple stochastic forward passes through Dropout to estimate the epistemic uncertainty . The time-consuming computation would potentially limit those methods to be applied in scenarios which ask for real-time deploying ability.
A highly related work is the work of Choi et al. . They proposed a novel uncertainty estimation method where a single feedforward is enough for uncertainty acquisition. However, they only tested their method in state space. In this paper, we are trying to tackle a much more difficult visual-based navigation problem.
Ii-C Visual domain adaptation
For the policy trained in simulated environments or based on datasets collected from simulated environments, the gap to the testing world (e.g. the real world) is always an essential problem. Following, We mainly review the policy transferring methods through image translating.
One probable solution is the so-calledsim-to-real where synthetic images are translated to realistic domain . With an additional adaptation step for each training iteration, the whole training-deploying procedure is inevitably slowed down.
Another direction is real-to-sim, where real-world images are translated back to the simulated environments. Zhang et al.  extended the CycleGAN  framework with a shift loss which improves the consistency of the generated image streams. They achieved great improvements in the Carla  navigation benchmark. Muller et al.  firstly perceived a real-world RGB image to a segmentation mask which is used to generate path points through a learned policy network.
For the pure unsupervised image-to-image translation problem, not like the previous deterministic translation model, multimodel mapping receives lots of attention from computer vision researchers [21, 22, 23]. Their goal is translating an image from the source domain to a conditional distribution of the related image in the target domain. This is naturally an applicable method for a robotic task because the training domain of the policy networks always contains data collected from various conditions (e.g. different weathers [3, 2]) for better generalization ability.
Iii An explanation of Aleatoric uncertainty
As mentioned before, two types of uncertainty in deep learning are introduced in [11, 12], the aleatoric uncertainty and the epistemic uncertainty. The epistemic uncertainty is the model uncertainty, which can be reduced by adding enough data. However, it is commonly realized through stochastic Dropout forward passes which cost too much time to be applied in real time. In this paper, we mainly consider the aleatoric uncertainty, the data uncertainty.
Following the heteroscedasticaleatoric uncertainty setup in , a regression task can be represented as
Here, is the input data. and is the groundtruth regression target and the predicted result.
is another output of the model and can represent the standard variance of the data. is the model weight of the regression model.
We provide an explanation for to show why it can represent the standard variance or the uncertainty of . Suppose that there is a subset of the training dataset, with size of . For the prediction and the uncertainty, . The optimization target of this subset is
is the model weight to optimize for this subset . Assume that all the in this subset are exactly the same, as . Because of the limitations of human labelling, they may be labelled with conflicting ground truths (like the noise labels around object boundaries in ). Then, the model would output the same prediction and uncertatin for all of as . The minimization target turns to be
Considering that and are conditional independent on , can be derived through the first order derivative as
For the model , it makes sense to output as the mean of . And that’s why , as the prediction variance of the , can be regarded as the uncertainty of . For a decision-making task, even though at some point the model cannot predict a good enough command, it should know this prediction is uncertain but not directly deploy it.
Iv-a Carla navigation dataset and benchmark
As mentioned in , it is difficult to evaluate the autonomous driving policy under a common benchmark in the real world. Thus, we use the Carla driving dataset111https://github.com/carla-simulator/imitation-learning to train the visual-based navigation policy. Then for the evaluation, we can naturally deploy it through the Carla navigation benchmark   under an unseen extreme weather condition. The distribution of the Carla dataset  and the benchmark details are available in .
The collected expert dataset of Carla includes four different weather conditions (daytime, daytime after rain, clear sunset and daytime hard rain). The original experiments  test their policy under cloudy daytime and soft rain at sunset. However, considering these two weathers are not available in the provided dataset for domain adaptation, we resplit the Carla driving dataset into training domain (daytime, daytime after rain, clear sunset) and testing domain (daytime hard rain) as shown in Fig. 2 following the setup in . The vehicle speed, ground truth actions and related measurements are also provided by the dataset  and considered by our policy model.
The final testing environment under daytime hard rain is super challenging. We believe that the difficulty in deploying the policy through visual domain transformation from the training domain to the testing domain (train-to-test) in this paper can be regarded as comparable as the previous real-to-sim experiments [2, 5].
Iv-B Uncertainty-aware Imitation Learning
We first introduce the framework of the policy network, which is the proposed uncertainty-aware imitation learning network as shown in Fig. 3. The backbone of our policy network is based on the conditional imitation learning network  for visual-based navigation.
For this framework, the training dataset includes all the three kinds of weather in the training domain mentioned in Section IV-A. In each forward step, an RGB Image from the training dataset and the related vehicle’s speed are taken as the input to the network. The extracted features are passed to four different branches with the same structures. A high-level command (straight, left, right, follow line) from the global path planner decides that output from which branch would be chosen as the final prediction. The output consists of the predicted action and its estimated uncertainty . In practice, as , we let the newtowrk predict the log variance . The action
is actually a vector including accelaration, steering and braking , with their related uncertainties , , and repectively.
We use to represent the weight of the policy network. and represent the groundtruth actions from the collected dataset. The policy prediction process and our uncertainty-aware loss function are as following.
Iv-C Stochastic train-to-test Transformation
For the unsupervised real-to-sim pipeline, previous works   are mainly based on the deterministic structure like CycleGAN . In this paper, we mainly consider the stochastic multimodel translation  through generative adversarial networks. The training domain, with three different kinds of weather, is represented as and the testing domain, with the single weather condition, is represented as as shown in Fig. 4. Those two domains are supposed to maintain their distinguishable style space ( and ) but share a common content space . The stochastic model contains a encoder () and a decoder () for each of the domain. The training procedure follows the setup of . For example, a image sampled from the training domain can be encoded to its style code and content code by the training domain encoder . The training domain decoder can also combine these two codes to generate as the reconstruction of as following.
For the corss-domain translation, the testing domain decoder combines a random style code from the testing domain and the content code to generate the translated image . The testing domain encoer takes this translated image as input and generates and as the reconstruction of and as following.
(), () and (, ) are constrained by L1 loss. A discriminator of the GAN structure is used to distinguish from the original testing domain images. We skip the reconstruction of the testing domain image and the translation procedure from the testing domain to the training domain. The content code is a 2D matrix with a size corresponding to the input image. The style code is a vector with eight individually sampled numbers from the normal distribution. Notice that the whole pipeline is unsupervised. Images from the two domains do not need to be paired for the training.
Iv-D Deloying phase
In the deploying phase, the final forward pipeline is showed in Fig. 1. We list all steps in Algorithm 1. After the training of all model weights including (Section IV-B), , and (Section IV-C), the whole piepline can be deployed in the testing environment. In each time step, a image collected from the mounted sensor on the vechcle in the Carla environment under the test weather (daytime hard rain) is firstly taken to the encoder of the testing domain to encode the content code . The style codes can be encoded from the sampled training domain images by the training domain encoder , or directly sampled from a normal distribution. Through the training domain decoder , the original input image is translated to various generated images under different training domain styles. Those generated images would be processed by the pretrained uncertainty-aware imitation learning policy network . Thus, we get actions and uncertainties corresponding to all the translated images. Among those actions, the one with the lowest uncertainty would be finally deployed to the mobile agent. We skip the details of the actions ( ) here, which are all decided by their own uncertainties individually.
V-a Model Training
For the stochastic translation model training, we follow the setup in 222https://github.com/NVlabs/MUNIT for steps with batch size as . As we mentioned before, the training domain consists of three weather conditions and the testing domain only contains the images under daytime hard rain. The size of original images Carla in dataset is . To maintain enough information in the content code, they are resized to for the stochastic image translation. After that, we get the trained encoders () and decoders (). They are used in the final forward pipeline as shown in Section IV-D.
. We train the whole training domain images (481600 images under three kinds of weather) for 90 epochs with a batch size of 1000. As the original setup in, we also try several different network structures for the uncertainty estimation. Experimental experiences show that the current structure is the most effective, which is processing the feature of the image and the velocity through another four branches outputing uncertainties of actions corresponding to the four high-level commands. The code for the imitation learning policy training is available online333https://github.com/onlytailei/carla_cil_pytorch. We implement all the code through Pytorch and all the training are finished by a Nivida 1080Ti GPU.
V-B Model Evaluation
|Visual trans. model||Direct||CycleGAN||Direct||CycleGAN||Stoc.-Single.||Stoc.-Random.||Stoc.-Cross|
|Ave. distance to goal travelled(%)||5.2/-||55.7/-||25.8/29.5||66.4/78.0||73.0/76.3||62.1/67.8||75.8/79.3|
Finally, we conduct experiments on the Carla navigation benchmark mentioned in Section IV-A. We compare different strategies both for the policy model and the visual domain transformation methods as follows:
For the policy model, we compare two different setups:
CIL: Map the state to the action without concerning the uncertainty as the original conditional imitation learning structure . The output would be directly deployed to the vehicle.
UAIL: Take the uncertainty as an output of the network as described in Section IV-B. When using multiple visual inputs, the action with the lowest uncertainty would be chosen as the final command.
For the visual domain adaptation methods, five strategies are compared:
Direct: Directly deploy the control policy in the testing environment without any visual domain adaptations.
CycleGAN: Transfer the real-time image to a specific training condition through CycleGAN  deterministically.
Stochastic-Single: Directionally transfer the input image to a specific training weather condition based on the style image from the training domain.
Stochastic-Random: Randomly sample three style codes to decode the translated images.
Stochastic-Cross: Directionally transfer the real-time image to all the three training weather conditions based on the style images from the training domain.
For the Stochastic-Single and Stochastic-Cross transfer methods, the style codes are encoded from style images in the training domain. We prepare ten images for each of the training weather condition. They are randomly sampled from the training dataset under the related weather condtion. In each step, Stochastic-Single samples one style image from the related weather condition and Stochastic-Corss samples three style images from each of the training conditions respectively.
The Carla navigation benchmark consists of four tasks including Straight, One turn, Navigation and Navigation with dynamic obstacles. where the vehicle needs to finish 25 different navigation routes in each task. Since the first three tasks do not consider any pedestrians or vehicles in the environment, previous methods   have achieved considerable generalization results on those tasks. In this paper, we mainly consider the most challenging one Navigation with dynamic obstacles under the testing weather condition444The uncertainty-aware policy is a little bit conservative so we relax the time limit for each trail. However, this doesn’t affect the results of infractions in Table. I.
We run each of the setups for three times and show the average/max result through the related benchmarks in Table I. Especially, as a deterministic transfer method, we build three transfer models between each training weather condition to the test weather condition through CycleGAN. In each time of the benchmarking trail, one specific transfer model is used. For the stochastic model, it can generate various stylized images through a batch operation. However, to achieve such processing through CycleGAN, we need to input the real-time image to each of the deterministic transfer model one-by-one which is both time-consuming and resource-consuming. So the three benchmark experiments of CycleGAN are under a specific training weather condition in each time. It is the same for the Stochastic-Single method, except that the specific training style image is sampled from the prepared subset with a size of ten as mentioned before.
We do not show the result for combining CIL policy model with stochastic transfer models. Because without uncertainties, there is no reason to choose the specific one among the actions generated through various input images. The results of CIL policy are referred from , where their multi-domain policy is what we mean CIL here. The results in  do not provide the max value of their trails.
Among the different policy models under Direct deploying and transfermation with CycleGAN, our proposed UAIL shows great improvements in all of the metrics of Carla navigation benchmark. For comparison of different transformation methods, Stocastic-Cross shows the best generalization under the testing weather condition.
To understand the selection mechanism under our proposed UAIL and Stochastic-Cross pipeline, we show two typical uncertainty estimation examples during testing in Fig. 5. As shown in Fig. 5-I, the outputs of actions and uncertainties between different translated images are quite close to each other, even though we choose the actions from the last image based on the lowest uncertainty. Since the straight line scenario is the most common one in the training dataset and the decision is relatively simple to make. The dynamic and challenging turning scenario in Fig. 5-II is what we are aiming to solve particularly. The second transferred image under Clear Sunset outputs a super tiny steering command which would potentially cause a collision with the car in front.
We proposed a deploying pipeline for the visual-based navigation policy under the real-to-sim structure. Through considering the aleatoric data uncertainty and the stochastic transformation when translating the testing image back to the training domain, a safer action selection mechanism is constructed for end-to-end driving. Experiments of deploying the pretrained policy in an unknown extreme weather condition through the Carla navigation benchmark show that our proposed pipeline provides a more confidential and robust solution.
For future work, finally transferring the trained policy to real-world autonomous driving in a challenging environment would be exciting. Concerning the consistency of image streams like  could be another future direction. Furthermore, this alternative choices decision-making pipeline may also guide the improvement of the model training for challenging samples. The related model augmentation towards more robust generalization could also be expected.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in CoRL, vol. 78. PMLR, 13–15 Nov 2017, pp. 1–16.
-  J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, and W. Burgard, “Vr-goggles for robots: Real-to-sim domain adaptation for visual control,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1148–1155, April 2019.
-  F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy, “End-to-end driving via conditional imitation learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018, pp. 1–9.
-  L. Tai, S. Li, and M. Liu, “A deep-network solution towards model-less obstacle avoidance,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, pp. 2759–2764.
-  L. Yang, X. Liang, T. Wang, and E. Xing, “Real-to-virtual domain unification for end-to-end autonomous driving,” in ECCV. Cham: Springer International Publishing, 2018, pp. 553–570.
-  C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning affordance for direct perception in autonomous driving,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
A. Giusti, J. Guzzi, D. C. Cireşan, F. He, J. P. Rodríguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. D. Caro, D. Scaramuzza, and L. M. Gambardella, “A machine learning approach to visual perception of forest trails for mobile robots,”IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 661–667, July 2016.
-  J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deep reinforcement learning with successor features for navigation across similar environments,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sep 2017, pp. 2371–2378.
-  X. Liang, T. Wang, L. Yang, and E. Xing, “Cirl: Controllable imitative reinforcement learning for vision-based self-driving,” in The European Conference on Computer Vision (ECCV), September 2018.
-  L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018, pp. 1111–1117.
-  Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, University of Cambridge, 2016.
-  A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in Advances in neural information processing systems, 2017, pp. 5574–5584.
A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to
weigh losses for scene geometry and semantics,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491.
-  G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine, “Uncertainty-aware reinforcement learning for collision avoidance,” arXiv preprint arXiv:1702.01182, 2017.
-  B. Lütjens, M. Everett, and J. P. How, “Safe reinforcement learning with model uncertainty estimates,” arXiv preprint arXiv:1810.08700, 2018.
-  M. Henaff, A. Canziani, and Y. LeCun, “Model-predictive policy learning with uncertainty regularization for driving in dense traffic,” arXiv preprint arXiv:1901.02705, 2019.
-  S. Choi, K. Lee, S. Lim, and S. Oh, “Uncertainty-aware learning from demonstration using mixture density networks with sampling-free variance modeling,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018, pp. 6915–6922.
-  X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to real reinforcement learning for autonomous driving,” in Proceedings of the British Machine Vision Conference (BMVC), 2017.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.
-  M. Mueller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving policy transfer via modularity and abstraction,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87. PMLR, 29–31 Oct 2018, pp. 1–15. [Online]. Available: http://proceedings.mlr.press/v87/mueller18a.html
-  X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in The European Conference on Computer Vision (ECCV), September 2018.
-  H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 35–51.
-  A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville, “Augmented cyclegan: Learning many-to-many mappings from unpaired data,” arXiv preprint arXiv:1802.10151, 2018.
-  J. Zhang, L. Tai, Y. Xiong, M. Liu, J. Boedecker, and W. Burgard, “Supplement file of VR-Goggles for robots: Real-to-sim domain adaptation for visual control,” Tech. Rep., 2018. [Online]. Available: https://ram-lab.com/file/tailei/vr_goggles/supplement.pdf